The Engine Code - 's dilemma

's funda

When tokenizing the movie title, the titles containing the 's have this behaviour. The ' sign is replaced with a space. So when tokenized, the 's appears as a separate word. For example, if the movie title is Schindler's List, then tokenizing operation gives ['Schindler','s','List']. So I am proposing following solution for the same -


  • We can use the split() function on the original movie title to get the total number of words in the title. If there is any mismatch in the value of split() and the words in the tokenized list, then we have remove the extra 's' appearing as a word in the tokenized list.
  • Before going ahead with this, we have to make sure that 's should exist in the movie title. Then and only then we can proceed ahead with the removal. We will require a regular expression matcher for it which gives us this information.