Movie-Recommender System (Netflix/YouTube)
A Data Science Project that you should include in your Resume
--
I was working on a data science project a few days ago when I learnt about the “Recommender System”. In this article, I’ll describe how I worked on this full-stack project and share the source code at the end. Let’s go over the project’s overall logic briefly.
Language Used: Python
Data from Kaggle: TMDb dataset
Framework used: Streamlit
Software Requirement: Jupyter Notebook and PyCharm
Hosted with: Heroku App
What is Recommender System?
- A recommendation system is a subclass of information filtering systems that attempts to predict a user’s preference for an item.
In layman’s terms, it is an algorithm that recommends relevant goods to users. For example, which movies to watch on Netflix, which products to buy on e-commerce, which books to read on Kindle, and so on.
Types of Recommender System:
There are 3 types of recommender systems: i) Content-based, ii) Collaborative-based and iii) Hybrid recommender systems.
Content-based filtering — In this type of recommendation system, users’ previously searched items are used to display or recommend relevant items. The attribute/tag of the product that the user likes is referred to as content in this context.
Collaborative-based filtering — Recommending the new items to users based on the interest and preference of other similar users is basically collaborative-based filtering.
Hybrid-recommender system — It is a recommender system that combines content and collaborative—based filtering.
Use-Cases Of Recommendation System:
There are many use-cases for it. Some are:
A. Personalized Content: Improves the on-site experience by making dynamic recommendations for different types of audiences, similar to how Netflix does.
B. Better Product search experience: It aids in categorizing products based on their features. For example, material, season, etc.
Aim of the project:
This project is a content-based recommendation system.
I will also work on a collaborative filtering and hybrid project later. So, follow me on Twitter and Instagram to remain up to date and learn more. We can also connect on LinkedIn.
Full Project Link:
About the project:
Before we begin the project in our notebook, we must first collect the data from Kaggle.
Through this link above, you have to download both the CSV files and merge them into one data frame.
In this data frame, the columns present there are:
{ movie id, cast, crew, budget, genres, homepage, id, keywords, original language, original title, overview, popularity, production companies, production countries, release date, revenue, runtime, spoken languages, status, tagline, title, vote average and vote count. }
To make a content-based recommender system, we need only the columns specific to the content of the movies, like:
{ movie id, title, overview, genres, keywords, cast, crew }
and ignore all the remaining columns in the data frame.
Why do we only consider these columns in a content-based system?
- Because, the title, overview, genres and keywords of the movie describe most about the movie's content. Every person has their favourite kind of content based on these columns, like genres.
- Then, you may think, what about the cast and crew? — Every person has their favourite main actor or director of the movie. That is why, while building a content-based recommender system, we also use these two columns.
If we want to build a collaborative based filtering recommender system, we would consider columns like:
{ vote_average, vote_count, production companies, popularity }
and approach with the further steps.
Data Preprocessing:
Now, as we are building a content-based system, we need to follow some steps:
- First, we have to split all the words in the columns, like ( overview, genres, keywords, cast, and crew )
- We should remove all the stop words.
- We should also have to gather all the finalized words in an array.
- Now, we need to collapse the columns like ( overview, genres, keywords, cast, crew ) and merge all of these columns and form a new column named “tags”.
- Finally, we have a well-processed data frame with only 3 columns.
Text — Vectorization:
To recommend movies with similar content, we will be using the “cosine similarity” method, which we only can apply to numerical representations.
So, as the tags of the data frame are all text, we should go through text vectorization to convert it into a numerical representation.
What is the “cosine similarity” method?
- Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
What is the “text vectorization” of text with NLP?
- The process of converting text into numerical representation is known as text vectorization.
After all these processes, we are all finally ready to build a function called “recommend” to recommend movies based on the similarity found after measuring with the methods mentioned above.
OUTPUT:
Outro:
I will be more sharing stories about Programming languages, Data Science, Machine Learning, Artificial Intelligence, and Blockchain. If you like my works, do follow me on my socials to stay updated with my life and my works.
Follow me on Instagram: https://www.instagram.com/warepam.eth/
Follow me on Twitter: https://twitter.com/Warepam_eth
Let’s also connect on LinkedIn: https://www.linkedin.com/in/richard-warepam-3b817420b/
If you want me to write about your company or products, mail me at richardwarepam16