Understanding Movie Quality from Plot Summaries using Natural Language Processing
For my capstone project at General Assembly’s Data Science Immersive program, I built a model that took plot summaries in as inputs and determined whether that plot would resulted in a good movie. Basically, I wanted to use Natural Language Processing (NLP) on Wikipedia plot summaries to predict Metacritic Metascores for a bunch of movies. This blog post is a basic overview of my process. All of the code can be found on my public Github. In the future, I will add more detail to specific parts of my capstone project.
I started off by web scraping Metacritic.com. I built the scraper using BeautifulSoup. It was easy to scrape the site because of the clean HTML, so kudos to Metacritic. The scraper pulled movie title by genre (action to westerns). Additionally, it pulled cast member names, director names, release year, plot summaries, and the compiled Metacritic Metascore. “Metascore” is an average critic review weighted by the critic’s publication and quality (as determined by Metacritic). This process reminded me of how FiveThirtyEight weighs and cultivates polls. I deemed any movie rated above 5.0 to be a “Good” movie and anything below as a bad movie. This was one hot encoded. The scraping was batched into groups by genre because attempting to scrape the website in one swoop was problematic. Lastly, I want to thank Mukul Ram for assistance in this step.
I used 100,000 plot summaries and titles from Wikipedia that
Github user Markriedl (https://github.com/markriedl/WikiPlots) scraped. I found this data on the Data Is Plural archive (definitely subscribe to Jeremy Singer-Vine’s newsletter). Markriedl’s code is publically available so you can rerun the scraper if you would like. I used spaCy and NLTK (Natural Language Toolkit) to preprocessing the data from Wikipedia. I stemmed, lemmatized, removed stop words and removed punctuation (the basic natural language processing). spaCy has a great out-of-the-box features that I was able to play around with. It could quickly text segment, tokenize, sentence segment, perform named entity detection, and perform parts of speech tagging. I plan on diving more into spaCy in the future.
My analysis was built with the help of Patrick Harrison’s Modern NLP in Python lecture at PyData in DC 2016. His jupyter notebook can be found at https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb. Additionally, Bhargav Desikan’s Topic Modeling with NLP framework Gensim from PyData Berlin 2017 was helpful. His jupyter notebook can be found at https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb.
I used a Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer from Scikit Learn to process the plot data. This approach multiples how frequent a word is in a document by an inverse of how frequently the word is used across the corpus. TF-IDF bumps up relevant words. TF-IDF transformed the plots into features.
At this point, I had a simple binary classification problem. The response variable was whether the movie was “Good” or not and the plots of each movie were converted into features. I used a support vector machine, random forest classifier, and XGBoost to classify the data. Overall, I ended up with really poor ROC-AUC scores.
The difference between a good crime movie and a bad crime movie may be more nuanced than my classification problem could pick up on. I think the similarity between good and bad movies, from a plot perspective, was too difficult for my first model.
I will have a few more iterations of this model within the next week. I will update this blog in that time. Until then, have fun playing with text data and spaCy!