Movies Similarity by Rating

Wismy Seide
Web Mining [IS688, Spring 2021]
5 min readMar 17, 2021

With the streaming wars on film now at the top of the marketplace due to Covid-19. We are always looking for the next great entertainment. Recently, I found that my movie recommendations were coming from friends and associates. I was recommended to watch Malcolm and Marie on Netflix by friends who were all watching and discussing the film which was about a relationship. It is safe to say I did not like the movie. I am fan of the actor and actress in the movie. John David Washington, the son of Denzel Washington, was in Tenet and the TV show Ballers. I felt he did a great job in both of those projects. Zendaya is in this HBO show called Euphoria and that movie was great as well. She is also in Spiderman. Although, they are great at their craft, I did not like the movie.

I would love to go back to old movie rating system based on similarity and not by word of mouth. If you liked the Conjuring, then you could be recommended to Annabelle. They both were similar movies with similar ratings. For this article I will be using the MovieLens dataset that could be found at: https://grouplens.org/datasets/movielens/latest/

First thing is first, I will import numpy, pandas, seaborn, and matplotlib. Matplotlib is a library for creating static, animated, and interactive visualizations in Python. Seaborn is also a python data visualization library that is based on matplotlib. It shows interface for drawing informative statistics such as graphs. Numpy is a python library used for working with arrays and I have used this in other courses as well as previous assignments. Pandas is a data analysis and manipulation tool. Pandas is widely used for machine learning and data analysis.

As a result, I can now see the movie titles, the movieid, and the genre of the movie. Some movies may fall into multiple genres. Case in point, Toy Story falls in the genre of adventure, animation, children, comedy, and fantasy. My next task is to merge the dataframes of ratings and movies.

I was able to merge the dataframes after some difficulty and Googling. I merged the dataframes on the movieId. Technically, I brought together the ratings csv and the movies csv.

My next task was the most important one. I must group the movies by the average rating. The average rating will let me group the movies together. I used the groupby function to accomplish this.

Next let us sort the movies in ascending order to see which movies are the highest rated.

Now, I ran into a problem. One user could rate the movie a 5 and it would appear at the top of the list. To fix this I did a count on how many times the movie was rated one. So, cult classics like Forrest Gump, Pulp Fiction has a high rating and a high number of users who rated the movie. This makes are data less manipulative.

Since number of ratings and average rating are the driving factors for this data. I created a dataframe that has the title and rating.

Now I can create a histogram with the average rating.

Now to find similarities between movies. I choose the movie Pulp Fiction which came out in 1994. The sophomore film of Quentin Tarantino, I remember watching this movie when I was a teenager. I feel like this movie lead the way for hundreds of unique films to get made.

Now to find similar movies to Pulp Fiction

So, from the output you can see movies that have a high correlation with Pulp Fiction.

Issues I ran into:

One of the big issues I had was that the MovieLens dataset was very big in size. It was about 500 megabytes. Although it did not take long to download, I saw the effects when running each kernel and the rate of speed it ran. When I ran one of the kernels, I kept getting an error that the unstacked DataFrame was too big. After researching the error, it pointed out that my Jupyter Notebook was running out of memory to run the kernel. This was a big issue. I had to find a desktop computer that would not run out of memory and run it on there to get the desired results.

Another issue I had was how to merge the dataframe. I was provided two csv files and needed to merge the date of the movie names from one file and the ratings from another file. It took a while to research how to get this done but I was able to accomplish it. Finally, some data was missing from the columns because not all users rated the movies. This was an issue as well. At first, I thought I should drop the data that had these null values in the column but then I decided to keep it. Deciding how to structure the data is very important.

I had fun correlating movie by ratings. I wish I could play with it more, but I had limited time to find another system that can run such a big dataframe. Too run datasets this big, I would probably need a supercomputer because the system I had could not handle such a load. I would love to use this dataset in the future because I am a big fan of movies.

--

--