Movie search algorithm: NLP part 1

Theethat Anuraksoontorn
Analytics Vidhya
Published in
5 min readAug 25, 2020

--

To be honest, this search algorithm is purely an accident that I mistook the recommendation as search engine. The idea start from that I want to create a recommendation system that based on Natural language processing which is not respond to your implicit data(past movie history, like, review, etc.) but the explicit one. I imagine recommendation that just tell you what movie you would like given the explicit data you give to them, such as “the movie that has alien”, “a lot of fight, no drama”. And what I just describe is NLP search engine, not recommendation system. And this mean this time we will working on the an unstructured data, and partially quote from Wikipedia it is

The information that is not pre-defined data model or does not organize in pre-defined manner.

The sentence simply mean that the unstructured data is a mess that needed transformation into a set of number for us to continue our task.

Imagine that you are about the use data, first thing in mind is that it need to be able to compute and it must be either categorical or numerical value. This is not when you face a sentence, text, paragraph. Those is simply a list of string combined.

Why we need computer to compute

That is because we are doing Machine learning, and for a computer to learn, it firstly needs to understand the data. Let’s first explain what is machine learning: Machine learning is simply how a computer create a rule without explicitly programmed. The rule we are talking about is how a computer make a decision, in this case, we are trying to make it decide what movie should it search for you when it receive a text.

To transform a text into a quantitative data, we firstly need to transform it to number. This is the same as when we transform categorical variable such as gender of male and female into male {1,0} and female {0,1}, in statistics, this is called Dummy variable, but in programming it is called One-hot encoding. As I mentioned this whole process of transforming unstructured text data into a trainable and processable data is called encoding. However, the simple but significant difference between the encoding in the most case, the encoding of text data go for every the words in your dataset. This vastly and literally change everything because one text paragraph could contain more than 200 characters or 40 unique words. Now, enough with learning preparation, let’s do some programming!

Movie Data

Go to Data.world site, it is a one of the very diverse open-source dataset repository, and download “Hydra-Movie-Scrape.csv”. By using pandas library will firstly explore a peak of our dataset.

As you can see it contains data about a movie which are Title of the movie, release year, movie summary, the short summary, rating,genre and else. Only those I have mentioned are potentially be useful and interested in our task.

Firstly, when we trying to encode a word we firstly have to separate a word first. This is simple we can either use “re” library in string to separate word for the sentence or using “nltk” library word_tokenize function to do it for us.

You can see these are the dummy variables we create in one single data point. There are total of 44 new variables created, imagine that we tokenize all the data points of 3,940 movies, that would be a total nightmare for most of ML beginner’s computer. However, there are options: change datasets or preprocessing. We can opt in the Short summary to save us sometime, however, the trade-off is that we lose some information of the summary for this shorter version.This emphasizes the point of choosing a proper training data which is purely to optimize signal-to-noise ratio or useful-to-useless ratio. As obvious as they are, a noise is something totally random and useless for our task and a signal is the data point which give us a clean cut for our generalization in the model. To compare the noise, we first look at number of characters.

This show that the Summary are 3 times bigger than its short versions, this length difference does not give us a clear look whether it differ because of noise or signal, we need a further look.

This is clearly not what we want, it reduce a noise but also the useful information to our task. Luckily, we have another option.

Preprocessing, the text data preprocessing is totally same idea to other data preprocessing, we just need to increase signal-to-noise ratio. By that let’s define what is really a noise in text, a set of word that contain near-absolute meaning or value but needed to be in the text for a purpose. This leads us to word such as “a”, “the”, “is-am-are”, “with”, or the verb-to-be, preposition and punctuation, as collectively is called stop-words. This is the noise which is used purely for aesthetic and grammatical purposes, by removing the stop-words, we surely can minimize the noise while keeping all the signal.

These leads to more of what is efficiently able us to keep signal while reduce noise, the next thing come in mind is the synonyms, such as “fire and flame”. . ( sure they might be used differently but semantically the same) If the words are sharing the same meaning, we should just make it the same word to reduce our model processing runtime. This process in NLP is called Lemmatization and Stemming, the former is what I explain to you, the latter is the reduction of the word to its stem such as changing, changes-> change.

Disclaimer: I will not demonstrate the hard-core stuff such as NNs for the POS and word embedding.

And this is it! for the first part of my search algorithm in movie dataset!, I hope you guys stay tuned and wait for the next part next week.

--

--