IMDB Recommendation Analysis
Introduction
We all agree movies allow us to escape — and there’s value in that — but it’s more than simple escapism. In the year 2017, more than three quarters (76%) of the U.S/Canada populations went to the cinema last year at least once. That adds up to a whopping total of 263 million people. Additionally adults tend to watch on average about 20–30 films a year. The added dimension of streaming services such as Netflix, HBO, Amazon Prime, and Hulu. Americans are watching movies more. A CNBC All-American Economic Survey poll of 801 Americans around the country shows that 57 percent of the public has some form of streaming service. This begs the question can we predict if a movie will be a smash hit?
What’s in a Movie?
Before any data was scraped, I had more than a few ideas on what made a movie truly great. The movie should have a great cast first of all. More often than not I will watch a movie with a specific actor/actress regardless of the plot.
Web scraping
With this project, I had some decisions to make regarding what movie ranking platform I wanted to use. I choose IMDB because information was laid out in a way in which I could get a lot of information from each page. There were 10,000 movies to scrape data from.This happened to be quite a bit more than any other site I looked at.
Hypothesis
I could have gone in a thousand different directions with this project. I did a little digging and saw some other projects people had created and decided incorporated sentiment analysis using Natural Language Processing. Most of the time people were trying to create score predictor based on a variety of features. I wanted to go in a different direction. I wanted to incorporate the use of NLP like the other projects I saw but this time I wanted to see if there was any relationship between the words and a movie succeeding. My thinking was that different words were used for a rated R movie versus a rated G movie. I also hypothesized that rated R movies were going to be the most popular and highly rated genre. I hypothesized this because many of the movies I knew and were widely regarded as the best films of all time, all seem to be rated R. There must be a shift in the language of the plot line with different ratings. Additionally I thought that Drama or Action would be at the top of the list of most popular genres.
Exploratory Data Analysis
This is where things get fun. I started to explore the data in an attempt to gather a story. EDA is the most important part of any data science project. I can start to prove and or disprove parts of my initial thoughts on the project.
Issues
There is always issues when you load in a data set, especially one that is scraped from a website.
The info function made me aware to the cracks in the foundation of the data. What data am I missing? What are the data types of the information in my data frame? This is the blueprint to the data frame so I can make some decisions on what columns to drop completely. Ultimately I decided to trim the data frame and fill in missing data when needed. If I started to try to add back information for all those missing values, I would have to to add back the most frequent string, integer or float. If I decided on this I run the risk of swaying the data set too far towards that most frequently occurring value. The Error column was an easy decision to remove. However, if I was ever going to attempt to test my hypothesis I needed to add back around 1000 missing objects from the rated column. I also needed to binarize the columns with words in them so I could eventually build a model.There also needed to be a target in which to measure the correlation of the features in the data set. There were three options for me to choose from, the IMDB rating, Metascore, and the amount of total IMDB votes. I choose to drop Metascore because there were only 4,545 objects in the data frame. There was far too many missing data points to add back in my opinion. Next, I needed to drop either IMDB votes or rating. I decided on dropping the votes because all the registered members of IMDb can cast their votes/ratings for any movie. IMDB then takes all the individual votes cast by the registered users and uses them to calculate a single rating. I decided then to keep the rating which the votes where based off of . Additionally a vast amount of the genres were grouped together. For example a movie could have had a genre of Action, Sci-Fi, Thriller. I made the decision to keep the first word and drop the following word. I did this to try to keep it as simple as possible when constructing the model.
First Impressions
I was able to see when I looked at the IMDB rating that top movie was “Till We Meet Again” by Bank Tangjaitrong while the lowest was “Proud American” by Fred Ashman. It was great to see right off the bat what the top and lowest movie is. I could also see that the top directors for the number of movies they were involved in was Woody Allen. Beating Clint Eastwood by 10 movies. Another finding I had was the average length of times for the movies. The most amount of movies were centered around 100 minutes long. With the rest of the movies above the third quartile at 150 minutes long.
In addition I could also see a breakdown in the MPAA film rating.
Most of the movies, like I originally hypothesized, were rated R. I will say however this is not a true number of how many rated R movies there were. I had to add in the missing values to the rated column. I choose the most frequently occurring rating to take the place of the NaNs. Keeping this in mind, even if we took the thousand missing values away from the total count of rated R movies the next highest rating, PG-13, would still be trailing behind PG-13 by a thousand titles. The further we go back, the more prominence R-rated movies had on the box office charts. Box Office Mojo’s earliest accounting shows a rating best average of $52 million a year(adjusted), but that was before the existence of the PG-13 rating and before movie theaters cracked down more against selling R-rated movie tickets to kids. Still, through the ’80s and ’90s, the average was mainly in the $30-$40 million range. Interestingly enough I could also see that most amounts of movies in a specific year were within the last 18 years. This could be because of how IMDB constructs this specific list and how they continually try to add more recent films. There was a fair amount of unfinished film projects in the data frame that I had to get rid of because of lack of data.
The vast majority of movies because of the previous graph were R-rated movies. However it is at the bottom of the list with a low rating of 6.3. Despite having more movies that could have pulled down the rating average, it is still above PG-13’s average score.
Natural Language Processing
This to me is one of the more interesting parts of the project. I choose to use TF-IDF for this project. TF-IDF is short for Term frequency — Inverse document frequency. I wanted to use this tool and visualize what terms were showing up most feqently. TF-IDF also enhances terms that are highly specific of a particular document, while suppressing terms that are common to most documents.This helped me find the frequency of Actors, and specific words in the plot.
Making Models
Now is the time to see the stand out features so I can report to Netflix what kind of movie they should focus on. Using regression methods I could see the relationship with some of these key features with the IMDB rating.
Linear Regression
I chooose linear regression for this part because it was a simple and an efficent way to analyze the relationship between the features and IMDB ratings. After fitting and running the model I could start to see how important particular features and how much pull particular data points had on the IMDB rating. The output gave me a list of coefficients which in turn gave me an idea of what particular data points to pay attention to. Using the Y-intercept I could implement a baseline figure to see how much sway a specific coefficient had.
Plotline Key Words
Using the top 100 words from my TF-IDF extraction, I saw that the word(s) that had the most pull was the word “New York”. For every instance, the word “New York” was used it had an addition of 2.07 rating points to a baseline rating of 6.25. Not too surprising the city that never sleeps was a popular keyword in the data set. New York shot over 300 movies from the year 2015–2016. A lot of our favorite movies are set in New York; Taxi Driver, Breakfast at Tiffanys, When Harry Met Sally, Gangs of New York to name but a few. Oddly enough the word that had the biggest negative pull on the rating of a movie was “York”. Following our baseline score the word “York” pulled down the rating score by -1.9. This could be due to an error of how the words were parsed and possibly how they were tokenized and given a frequency.
MPAA Rating
The ratings were a bit of a mess when looking over the importance of the coefficients. In hindsight I should have removed the movies with a TV rating because simply put, they are not movies. Additionally, I would have removed the rating “APPROVED”, “NOT RATED”, “PASSED” because they are not within the traditional MPAA rating list.
I know now that I would have cleaned the list first and ran a whole new regression model on the ratings to get a better output than this. Nevertheless, looking from this list the next highest rating on the list would be the “UNRATED” rating. This was surprising to me because it goes against my original hypothesis. I would have figured the highest pull on the rating of a movie would have been an “R” rating. Keeping in mind another test post cleaning would show another output, this shows that “R” movies have a downward pull on the rating of a movie. Nevertheless, the highest downward pull comes the rated “X”.
Genres
This was probably the most surprising/intriguing finding to me. Again my prior knowledge of movies failed me in what I thought the most important feature was. “Music” ended up being the most important genre with a coefficient of 2.14. The Y-intercept for genres was a baseline of 6.40 so music bumped the score up to a respectable 8.54. The surprise doesn’t stop there. The lowest coefficient for the genre was incredibly “Action”. This blew me away, I figured congruent to my original hypothesis, that Thriller or Horror would pull the rating for a movie down the most.
Actors
The linear regression output for actors was another interesting discovery. The actor that had the most pull was Woody Allen. The baseline for the actors was 6.33 so Woody brought up the score a measly .30. There were quite a bit of actors with a zero for a score and no negative numbers so I can’t unequivocally say for sure that one actor has a great pull negatively on the rating of a movie than another.
Random Forrest Regression
I decided that linear regression wasn’t the only model I wanted to use for this project. I wanted to use a more diverse method. The benefit of using Random Forest here was that it could create a multitude of decision trees, make a variety of decisions regarding the features and average out the outputs from those particular trees. I also used a bagging method to resample the data to create multiple different models from a single training dataset. The outputs for the Random Forrest Feature Importance were a bit different than my Linear Regression output.The out put is number between 0 and 1 to calculate the importance of a feature each time the algorithm makes a split. However because I have no baseline(Y-intercept) with Random Forest I choose not include these results. The results were helpful for feature importance but because Random Forest is non parametric regression method the number between 0 and 1 doesn’t help here.
Conclusion
What makes a movie a smash hit? I have found using data cleaning, EDA, and Regression models that indeed we can. If there was an upcoming unrated musical that was set in New York and starring Woody Allen my data suggest that a streaming service such as Netflix should invest heavily into getting that movie on the front page for its users. In conclusion my previous hypothesis was ultimately proven to be incorrect. Netflix should also be paying close attention a very strange movie that could be coming out in the not so distant future.