Does Book Success Translate into Movie Success?

Natasha Borders
@natashaborders
Published in
4 min readApr 20, 2019
Photo by Jaredd Craig on Unsplash

From the moment I learned about the web scraping techniques at Metis and was told I had to use them in this project, I knew I had to reach for the stars. Web scraping had always seemed to me something akin to magic, the machine coming alive and behaving like a person in order to swiftly and confidently retrieve any information I desired from the most dense of websites. My stars were quite literal — I decided to take a look at the ratings of the books which had become movies and see if I could predict the movie ratings based on the rating of the source book.

Retrieving the data was an interesting and challenging procedure, introducing me to the intricacies of HTML, and the aforementioned wizardry of Beautiful Soup and Selenium. While Beautiful Soup worked great for the majority of my fellow Metis students, who were perusing the table-heavy government websites and pokemon stats, I was going after the constantly-updated Goodreads and IMDb listings, the latter full of pop-up trailers and ads. Selenium proved to be more useful for me, and I really enjoyed the functionality of being able to iterate through multiple pages of results and searching for movie titles on the fly. Additionally, I investigated the Goodreads API, but found that it was less ideally suited for my project.

Once the data was collected, cleaned and transformed, I ended up with about 700 books which were made into a movie, with the results lagging by about 1 year as I also had to supplement my scraped data with a readily available Kaggle dataset featuring the information from MovieLens.

The initial results were somewhat surprising. Here are the top five rated books and movies:

Top 5 highest-rated books
Top 5 highest-rated movies

Digging further into the data, I noticed a curious trend. Top movies are produced from books ranging from adequate to great, whereas the best books produce good enough movies, but not necessarily the best. It seems that some lesser liked books produced into good movies gained traction as cult classics, but the book remained forgotten. When the book itself is a hit, such as Harry Potter series or the Lord of the Rings, the movies are well-received but didn’t seem to stand as well on their own.

After performing linear and ridge regression analysis, I concluded that the book rating alone explains only about 7 % of variation in movie ratings. Combining the book rating with the rest of the features, the regression models explain only about 25 % of the variation in the movie rating (my adjusted R Square for the Ridge Regression was only ~0.25). Clearly, there is much more to how highly a movie is rated than the few features I examined in my model.

To challenge myself a little bit, and to test a hunch I got while looking at the data, I ran a K-Means Cluster Analysis on the feature data to see if anything interesting might emerge. The model split the data into two clusters, which I dubbed “The Best” and “The Rest”. The difference between the two clusters was not highly pronounced, but I was slightly happier with the results than those from the regression analysis.

In conclusion, knowing the information about the book alone was not quite enough to successfully predict the rating of a movie based on said book. Some interesting considerations I would love to look into are the variations between the book editions and the movie remakes, as well as bring the monetary considerations into the picture and address the profitability of a movie based on the sales of the source book. This project definitely indulged my love of books and served as a perfect platform to study linear regression in greater detail.

Overall this was an amazing learning experience, and I am looking forward to applying everything that I learned to the next project at Metis. You can find the materials for this project over at my GitHub repo.

--

--

Natasha Borders
@natashaborders

Data Scientist | Analyst seeking roles in the San Francisco Bay area. LinkedIn: in/natashaborders/ | GitHub: natashaborders | Me: natashaborders.com