#WakandaForever = $$$$$
Analysis of reddit interactions to predict opening box office revenue
In 2017, a Fortune 50 company gave me the opportunity to be an intern in a very selective program. So, I packed my bags and moved to a brand new city for a couple of months to take on this exciting new opportunity.
Working for such a big company, actually wasn’t all it was cracked up to be. During the 4 months there, I’m pretty sure my manager only got my name right once. This re-framed how I thought about working at these “luxury” tech companies that promise you the world.
Let’s get to the nitty-gritty. My main task during this internship was to predict the opening weekend box office revenue of newly released films in the industry from all sorts of studios. Intimidation was a mere blip on the radar of emotions I felt having to take this project on with almost no guidance. The data source given to predict this number, you ask? Social media.
In 2014, Twitter stated that there are about 500 million transactions on the platform per day, and that number has surely surged since then. It has also been stated that 83% of the US population between the ages of 18 -39 use Twitter. That means that according to the chart below, there is a very high chance that the 44% of frequent moviegoers in the US, are on Twitter and tweeting about upcoming films.
This data, AKA your tweets, is then sold by Twitter to corporations (hint hint) that want to analyze them for their own financial gain. I was able to build a model and program that was able to predict the opening box office weekend revenue of 8 different movies over the course of 2 months with a <15% error margin. When you’re talking millions of dollars, even if the model was off by a little, seeing that number was invigorating.
This was such an exciting project , and I was very proud of my work — but unfortunately, I was forced to turn over all of my code and I was not allowed to use it outside of the company. So, I’ve decided to re-create it and document it this time. Since I don’t have stacks of cash to throw at Twitter for their tweets, and they have super restrictive APIs, I‘m talking another route.
Reddit is another popular data source with an immersive community. In July 2018, the site had over 1.6 billion unique visitors. With specific niche communities, gathering data on specific upcoming movies should not be a challenge. With this being the main data source, I plan to narrow down the data used in this research to the subreddit r/movies. As a default subreddit, it appears on everyone’s front reddit page which ensures maximum visibility of the posts.
As of today, 10/31/2018, there are 18.6 million subscribers to this subreddit. That serves as a great sample size packed with diverse opinions that will increase the overall accuracy of this study.
THE SCRAPING BEGINS
The first step is to actually collect the data. By instinct, I open up R IDE and plan to use a web scraper to gather all of the data. But upon more research, a developer by the name of Ivan Rivera comes to the rescue. He developed a tool by the name of RedditExtractoR that is designed to search on a specific subreddit using a provided search term, and pull in loads of data on the posts returned by the search query.
An upcoming movie that I’ve decided to base the first prediction on is Widows. I’ve recently gone through a HTGAWM binge-fest and Viola Davis is intoxicating. After running the tool using “Widows” as a search term and sorting search query on the number of comments, so the posts with the most comments come first to gather the most data, here is a table of the output.
As you can see, this movie title is a little too broad for the search engine to pull back anything even relating to the movie in the first 10 results. Luckily, I am watching TV as I write this, and a trailer for the Queen based film Bohemian Rhapsody, has just finished playing. I pop that in the tool and the results did not disappoint.
With a return of 7,650 comments, and most of the posts pertaining to topic pre-release of the movie. I can successfully capture the “hype” of the users to predict the opening weekend revenue. Using the same tool, there is another function that allowed me to dig a little deeper into all of the posts and pull the individual comments.
I actually ran into an issue on Mac where the tool wouldn’t pull the individual post information. The tool uses readLines() which is known to have issues accessing web content on some systems. In turn, I got my hands on a Windows VM to pull the data and exported it to my Mac.
Once I got the dataset into RStudio, I decided to make a new data frame summarizing the information pulled from reddit. I organized it by post, the number of comments on each post, the amount of upvotes on the post, the average amount of upvotes for the comments, and the total amount of upvotes for the comments on each post. R then allows you to export the data set into an Excel spreadsheet and we can export to data visualization software from there for analysis. The code used to make this data table and the finished spreadsheet is shown below.
Now that we have this data, what do we do next? In order to compare this movie with other movies, I decided to create a weighted score based on this data. First by post, we will weight the number of comments (0.4), the post score (0.3), the average comment score (0.2), and the total comment score (0.1).
The number of comments is weighted the highest since this is the biggest indicator how how many people felt strongly enough to comment on a post on reddit. The post score is the second highest weighted factor because it shows the amount of engagement with the post. First, we will create this weighted score for each post, then we will average this number to get our overall movie score.
I come to the conclusion that Bohemian Rhapsody has a “movie score” of 2,388pts. What do we do with this value? We now have to do the same process with (just ball-parking) 50 movies from the past year to get their movie scores so we can compare. Instead of plugging in each individual movie title into the original R code, I decide to automate this process.
REPEATING THE PROCESS
First, I create a CSV with 50 movie titles. Pretty simple right? Grabbing the top 50 grossing movies from this year should suffice. My plan is to have R import the CSV, parse for the movie title, and grab the reddit data. This process will occur in the R console for ease since it will take a bit of time, and I don’t want to overload RStudio with all of this work.
After editing my R Markdown code, and converting it into a R Script, we can now gather all data for all of the movies on the list of 50. The cursed Mac issue once again forces me to run this on a VM so the process took 2x the original planned time 🙄.
After a few hours, I was able to compile a list of movies with their associated movie score in an excel document. But, now that I have these movie scores, what do I do with them? I can already infer which movies have the highest score based on their popularity. So, what actual use does the movie score give us?
It gives us a measure to use when completing our final analysis. The next step to take is to continue building the model for these movies. As a reminder, the question we are trying to answer is: Can we predict a movie’s opening weekend revenue using social media data? Therefore, the next measure we need to build our model is opening weekend revenue data.
To grab this data, I am also choosing to use R to scrape the data from a website that tracks movie revenues. The website I chose to grab this data from is BoxOfficeMojo. The layout of the website is fairly simple and doesn’t use advanced HTML which should make scraping very seamless. I plan to have the R script, search the site for the movie title, sort the search by the gross revenue, and go to the first search result. Then, once on the movie page, I will grab the values that I am looking for.
A useful tool I use to find which nodes to grab is a Chrome extension called SelectorGadget. When you select text on the webpage, this extension will tell you which node the text is stored in, which will help during scraping.
The search functionality of this website isn’t perfect. When creating a table with the final opening weekend and total revenues for each movie, I have the script grabbing the movie title from the same page as the revenues, even if it does not match the original search term. This is just to keep everything honest and to keep the data as accurate as possible.
Once the script finished, I took a look at our final table and it looked great! Now for the fun part.
Now that we have all of our data, we must somehow visualize it. In order to create actual evidence that there is a correlation between social media buzz and a movie’s opening weekend, I decide to plot all of the points in Tableau, and create a trend line.
We have an issue first… Since we created two separate CSVs, we will have to join them. Luckily, Tableau has great data management and on the import of both files, you can join them on movie title.
This gives us one huge table with all of the data we need: the movie (duh!), the movie score generated from social media, and the movie’s opening weekend revenue. These figures will allow us to create a model in Tableau to help us predict future movie revenues. If we compare the movie’s opening weekend revenue to the movie score generated earlier, we can create a regression line.
This line tells us that for each increment of the movie score, an additional $9000 could be generated for that movie. Other important values to notice on this model are the R-Squared value and the P-value.The R-Squared value is showing that the data is representative for about 43% of the actual model, or that the line fits about 43% of the data. If we included more movies to our data, we will be able to create a better model. The P-value is < 0.5, which shows that our data shows strong evidence for our hypothesis.
If we wanted to take this a step further, as I did the first time I created this project, we could include the number of comments or posts in the big table for each movie. This would allow us to create another regression line to see how much individual revenue each comment or post could contribute to the movie’s opening revenue.
To follow back up with our original prediction, we calculated that Bohemian Rhapsody had a movie score of 2,388. If we plug that into our model, we get a total of $43,203,259. The actual opening weekend revenue for Bohemian Rhapsody was $51,061,119. That gives us about a 15% margin of error, which isn’t too bad for the model we’ve created.
In conclusion, social media giants and other corporations have the ability to take a simple tweet you posted about #WakandaForever, or how Ryan Reynolds as the voice of Pikachu is an awful idea, and turn it into money.
All of the code used to create this model is hosted on github under the GNU GPL license.
Edit (12/11/18): Adding predictions for Bohemian Rhapsody based on model and actuals.