Hybrid Recommendation System Web-Application Part 1: Exploratory Data Analysis with PostgreSQL

Syed Muhammad Hamza
Analytics Vidhya
Published in
13 min readMay 24, 2021

This article is a part end-to-end machine learning project series where I will build a web application for a Regression-based Movie Recommender system that is a hybrid of content-based and collaborative filtering algorithm from scratch.

Photo by Denise Jans on Unsplash

Introduction

Over the last two decades, recommender systems have become State-of-the-Art Algorithms and can be regarded as success factors for companies such as Google, Amazon, and Netflix. The main objective of a recommender system is to produce relevant recommendations to a collection of users for items that might interest them. This article is a part end-to-end machine learning project series where I will build a web application for a Regression-based Movie Recommender system that is a hybrid of content-based and collaborative filtering algorithm from scratch.

In this part of the Series, I’m going to perform Exploratory data analysis with PostgreSQL. The only reason why I’m using PostgreSQL for EDA in this project is that most newbies applying for any data role should have experience with SQL in their pocket and using SQL in EDA can let you demonstrate just that at a minimum level also using PostgreSQL for EDA gives you a sense of how the relational database works, how data in it is stored, more importantly how to get data out of the database. SQL is a very flexible declarative language that can be used to pose a rich set of queries to get data out of the database I’m going to use SQL queries for data manipulation for almost every single graphical EDA technique today along with some other python libraries mainly for visualization.

If this project interests you, the code is available on my Github with all detailed description regarding this project.

The web application is deployed on Heroku and will be covered in the latter part of the series, but if you want to know how the final product will look like feel free to browse the following link.

Note: Familiarity with SQL(joins, subqueries, view, etc.) is a Prerequisite for this part of series. If you are not familiar with SQL feel free to skip this article altogether and start with the next part because this is the only part of the series that involves SQL and PostgreSQL for EDA and is not going to be a prerequisite for the next part.

PostgreSQL Database with ODBC

PostgreSQL is a powerful, open-source object-relational database system to establish a connection to my PostgreSQL Database Server I’m going to be using ODBC (Open Database Connectivity) that’s the standard that says if you write to through this protocol this standard API then you can work with ODBC it accomplishes DBMS independence. An ODBC driver can be thought of as analogous to a printer driver or other driver, providing a standard set of functions for the application to use. Drivers exist for all major DBMSs like Oracle, PostgreSQL, MySQL, and Microsoft SQL Server but you’ve to manually install DBMS of your choice and configure its driver before using it in your python code.

Setting up and accessing PostgreSQL with ODBC

Step 1: install PostgreSQL from here

Step 2 : The next step is to Setup ODBC (Open DataBase Connectivity) for that, follow steps specified in the following link

[Note]: skip (Connect to PostgreSQL Server [top]) and start from Setup ODBC [top]

Step 3: Open SQL Shell installed with PostgreSQL in the first step and execute the following Statement

CREATE DATABASE DRecommenderSystem_db;

Now I’m going to define a helper class named PostgreSQLconnection for PostgreSQL in Python file named PostgreSQL_Database_wrapper.py, which will save me a lot of time from rewriting the same lines of code over and over again I’ve included docstrings for each module, Please give it a read Here before moving forward.

Exploratory Data Analysis with PostgreSQL

Dataset:

The dataset is provided by GroupLens and can be downloaded from here it contains the following files(links.csv, movies.csv, ratings.csv, and tags.csv)

“This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996, and September 24, 2018. This dataset was generated on September 26, 2018.”

Loading dataset

Importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#Loading data from csv files into pandas dataframesmoviesData=pd.read_csv(r'ml-latest-small\movies.csv')
ratingsData=pd.read_csv(r'ml-latest-small\ratings.csv')
linksData=pd.read_csv(r'ml-latest-small\links.csv')
tagsData=pd.read_csv(r'ml-latest-small\tags.csv')
'''
I have one column named Timestamp in my RatingData dataframe I've to rename it to something else before inserting data into SQL tables because the timestamp is a data type in SQL
'''
ratingsData['timestamp'] = pd.to_datetime(ratingsData['timestamp'], unit='s')
ratingsData.columns = ['userId', 'movieId','rating' ,'timing']

Creating Tables in PostgreSQL

I’ve loaded my four CSV data files in pandas data frames, next step is to establish a connection to my PostgreSQL database to create tables in it for each dataframe and insert data into them.

'''
importing PostgreSQLconnection class from my PostgreSQL_Database_wrapper file and creating an instance of it
'''
from PostgreSQL_Database_wrapper import PostgreSQLconnection
db=PostgreSQLconnection("DRecommenderSystem_db","postgres","*********","localhost")

MovieLens dataset consists of four [.csv] files namely (links.csv, movies.csv, ratings.csv, and tags.csv) I've loaded these files into pandas' dataframes. Now I'm going to create tables using SQL and load data from these dataframes in these tables by calling PostgreSQLconnection.create_table() defined in PostgreSQL_Database_wrapper.py and passing SQL statement as argument to it.

Here I’ve four tables with both PRIMARY KEY Constraint

The FOREIGN KEY constraint identifies the relationships between the database tables by referencing a column or set of columns, in the Child table that contains the foreign key, to the PRIMARY KEY column or set of columns, in the Parent table.

and FOREIGN KEY Constraint.

The PRIMARY KEY constraint uniquely identifies each record in a table.

Most importantly, after looking at the data in data frames I noticed my data is not in First normal form (1NF)

The first normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if the domain of each attribute contains only atomic.

That means each attribute/field/column should contain one and only one value hence a list of values in an instance of one field is a violation of the 1NF form and I have two problematic tables. The first one is Tags with attribute/field/column “ tag” containing multiple values in an instance of one field similarly in case of table “MovieData” I’ve attribute/field/column “ generes” containing multiple values in an instance of one field, Generally when designing schema for DBMS you’ve to make sure your data is in 1NF which always going to hold true if you follow the standard ER-Diagrams in conceptual design, ER-model enforces atomicity in fields/attributes and in next phase of logical design where you will translate your ER-Model into the data model(mostly likely Relational Model) your tables will be in 1NF. Now back to my problem I’m gonna do something stupid and assume values in fields like “ generes” and “ tag” are a long atomic string of character because I’m not going to create any critical important application that uses this database and enforces standard to be followed, if it was the case then I would have had first done the One-hot encoding for each non-atomic field that will extend the number of columns/fields.

#sql statement to create table
sqlstatement='''CREATE TABLE Movies(
movieId INT,
title text,
genres text,
PRIMARY KEY(movieId)
)'''
db.create_table(sqlstatement, ' Movies')
#sql statement to create table
sqlstatement='''CREATE TABLE Rating(
userId INT,
movieId INT,
rating REAL,
timing TIMESTAMP,
PRIMARY KEY(movieId,userId),
FOREIGN KEY(movieId)
REFERENCES Movies(movieId)
)'''
db.create_table(sqlstatement, ' Ratings')
#sql statement to create table
sqlstatement='''CREATE TABLE Links(
movieId INT,
imdbId INT,
tmdbId REAL,
PRIMARY KEY(movieId)
)'''
db.create_table(sqlstatement, ' Links')
#sql statement to create table
sqlstatement='''CREATE TABLE Tags(
userId INT,
movieId INT,
tag text,
PRIMARY KEY(movieId,userId),
FOREIGN KEY(movieId)
REFERENCES Links(movieId)
)'''
db.create_table(sqlstatement, ' Tags')

following are the tables being created in database as resut of above code Movies, Rating, Links and Tags

As it can be seen from my ER-Diagram I’ve four entities one for each table along with their primary and foreign keys for demonstrating relationships I’ve used crowfoot notation.

Loading Data into tables from DataFrames

After creating my tables now I’ve to load data from data frames into tables

'''
Accessing cursor attribute of PostgreSQLconnection object that itself holds an object of the cursor then using it to insert data
into my tables(it's a bad practice to access any attribute from outside of your class it violates encapsulation and information
hiding, unfortunately, there is no way to enforce information hiding in python that's why hacks like this one will work but make
sure whenever you define your classes they are as loosely coupled as possible always follow SOLID principles) tbh I forgot to
implement insert_data method in PostgreSQLconnection class that's the reason I'm accessing it from outside
'''cursor = db.cursor
for row in moviesData.itertuples():
cursor.execute('''
INSERT INTO Movies(movieId,title,genres)
VALUES (?,?,?)
''',
row.movieId,
row.title,
row.genres
)
db.db_connection.commit()
print("successfully inserted data from DataFrame into table")
for row in ratingsData.itertuples():
cursor.execute('''
INSERT INTO Rating(userId,movieId,rating,timing)
VALUES (?,?,?,?)
''',
row.userId,
row.movieId,
row.rating,
row.timing
)
db.db_connection.commit()
print("successfully inserted data from DataFrame into table")
for row in linksData.itertuples():
cursor.execute('''
INSERT INTO Links(movieId,imdbId,tmdbId)
VALUES (?,?,?)
''',
row.movieId,
row.imdbId,
row.tmdbId
)
db.db_connection.commit()
print("successfully inserted data from DataFrame into table")
for row in tagsData.itertuples():
cursor.execute('''
INSERT INTO Tags(userId,movieId,tag)
VALUES (?,?,?)
''',
row.userId,
row.movieId,
row.tag
)
db.db_connection.commit()
print("successfully inserted data from DataFrame into table")

Now I’m all set for EDA. EDA by using SQL queries can be more flexible instead of doing it on Pandas Dataframes but be cautious it can act as a double-edged sword because it’s not very efficient the only reason why I’m using Postgresql here because most newbies applying for any data role should have some kind of experience with SQL in their pocket and using SQL in EDA can let you demonstrate just that nevertheless, the flexible queries you can pose in the PostgreSQL database for EDA is phenomenal.

Top Ten Rated Movies

Here I’m taking join of Rating and Movies table then creating view named “MoviesWithRatings” and executing query over this View MoviesWithRating to extract top ten most rated movies then plotting the figure

#taking Join of tables(Rating and movies) then creating View(MoviesWithRatings) of result
command ='''
CREATE VIEW MoviesWithRatings AS
SELECT Rating.userId,Rating.movieId,Rating.rating,Movies.title
FROM Rating
LEFT JOIN Movies
ON Rating.movieId = Movies.movieId;
'''
db.run_command(command)
#executing query over view
query = '''
SELECT title, count(movieid) AS movies_count
FROM MoviesWithRatings
GROUP BY title,movieid
ORDER BY movies_count DESC
LIMIT 10;
'''
query_result=db.run_query(query)
plt.figure(figsize=(15, 8)) barh = plt.barh(query_result['title'], query_result['movies_count'], color='blue') plt.title(' top ten most rated movies ') plt.xticks(np.arange(0, 400, 50)) plt.tight_layout()

The output looks like this,

Observation: good gracious! the result is very similar to what you will find on rotten tomatoes and IMDb. Indeed these are the greatest movies of all time endure for all kinds of reasons — they offer plots that twist and turn, give us characters that we fall in love with, depict experiences that change us, and thrill us with incredible filmmaking craft but most importantly this result verify my small dataset from MovieLens is representative enough to be used for training.

Most popular genres among directors

Now I’m posing query for each genre to count the number of movies contains that genre as a substring in “generes” field/attribute/column and appending result in a list for later use to plot it

# extracting all unique genres from string field genres 
genres_unique = pd.DataFrame(moviesData.genres.str.split('|').tolist()).stack().unique()
genres_list=genres_unique[:len(genres_unique)-1]
genres_count=[]
#QUERY
for genre in genres_list:
query = '''
SELECT count(Movies.movieId)
FROM Movies
where Movies.genres LIKE '%'''+genre+'''%'
'''
query_result=db.run_query(query)
genres_count.append(query_result.iloc[0]['count'])
data=pd.DataFrame({'x': genres_list, 'y': genres_count})
data=pd.DataFrame({'x': genres_list, 'y': genres_count})
plt.subplots(figsize=(10,5))
sns.barplot(x=data.x, y=data.y, palette="rocket")
plt.xticks(rotation=-45)
plt.title('Most popular genres among directors');

The output looks like this,

Observation: Directors seem to direct movies with the genre as “Drama” the most which make sense because all other genres will require a decent budget also the next two most popular genres among directors are “comedy” and “thriller” and the same reasoning can be held for both of them as well but this claim of mine is very shallow given Correlation is not causation

Number of ratings each year

Let’s move on and now try to check for growth for the number of ratings each year I’m expecting exponential growth given the number of users joining the internet every day is growing exponentially here in this context Correlation is causation

#executing Nested query over table rating
query = '''
SELECT year,count(year)
FROM( SELECT rating.movieId,EXTRACT(year FROM rating.timing) AS year
FROM rating) AS derivedTable
GROUP BY year
ORDER BY year DESC;

'''
query_result=db.run_query(query)
plt.subplots(figsize=(10,5))
plt.plot(query_result)

plt.grid(True)
plt.title('Number of ratings each year');

The output looks like this,

Observation: Huh! I was expecting exponential growth in ratings each year, given more and more people around the world joining the internet every day! what went wrong? This result just points towards the fact this small MovieLens dataset is cherry-picked that’s why it’s not reflective enough of real data moreover this small dataset is only 1 MB that’s why I’ve used this Dataset of 265 MB for training deployed model.

Distribution of rating

let’s explore

(1)how users rating movies based on genre,How many different genres are in the dataset, and how are they distributed?

#executing query over rating table
query = '''
SELECT rating
FROM Rating
'''
rating_result=db.run_query(query)
rating_result.hist(bins=9, grid=False, edgecolor='w',density = True, label ='All genres', figsize=(10,5))
#extracting out all genre from moviesData using pandas
genres_unique = pd.DataFrame(moviesData.genres.str.split('|').tolist()).stack().unique()
genres_list=genres_unique[:len(genres_unique)-1]
#executing Nested queries within looping structure over table join of rating and movies and doing string comparision for each genre
for genre in genres_list:
query = '''
SELECT rating
FROM(SELECT Rating.userId, Rating.movieId, Rating.rating, Movies.genres
FROM Rating
LEFT JOIN Movies
ON Rating.movieId=Movies.movieId) AS derivedTable
WHERE genres LIKE '%'''+genre+'''%'
'''
query_result=db.run_query(query)
query_result.rating.plot(grid=False, kind='kde', label=genre,figsize=(10,5))
plt.legend(loc=(1.05,0), ncol=2)
plt.xlim(0,5)
plt.xlabel('Movie rating')
plt.title('Movie rating histograms')
plt.show()

The output looks like this,

Observation: Looking at the distribution of ratings for each genre, it seems they all have a very similar distribution which is a bit normal slightly skewed to the right with a mean of around 3.5 because ratings between 3–5 are the preferred choice for users when rating movies.

Time series analysis

Timestamp type in my data enables further exploration of my data using time series analysis which is a powerful technique that can help me understand the various temporal patterns in my data by decomposing data into different cyclic trends.

Time series analysis can also allow me to predict how levels of a variable will change in the future, taking into account what has happened in the past. Sometimes time series analysis can also be helpful during feature engineering to come up with a better set of features.

Movie rating trend by week

#from pandas.tools.plotting import andrews_curves
genres_list=genres_list[:5]
plt.figure()
#executing Nested queries within looping structure over table join of rating and movies and doing string comparision for each genre
for genre in genres_list:
query = '''
SELECT rating,timing
FROM(SELECT Rating.userId, Rating.movieId, Rating.rating, Movies.genres,Rating.timing
FROM Rating
LEFT JOIN Movies
ON Rating.movieId=Movies.movieId) AS derivedTable
WHERE genres LIKE '%'''+genre+'''%'
ORDER BY timing DESC
'''
query_result=db.run_query(query)
query_result['timing']=pd.to_datetime(query_result['timing'])
query_result.set_index('timing', inplace=True)
query_result.resample("W").mean().plot(marker="o", markersize=1,figsize=(20,5))
plt.title(genre+' Movie rating trends by week')

The output looks like this,

Observation: when resampled on weekly basis there is no weekly trend in Movie ratings it’s a random cyclic time series for each genre.

Movie rating trend by year

Let’s resample and find out if there’s any yearly trend in my Movie ratings

genres_list=genres_list[:5]
plt.figure()
#executing Nested queries within looping structure over table join of rating and movies and doing string comparision for each genre
for genre in genres_list:
query = '''
SELECT rating,timing
FROM(SELECT Rating.userId, Rating.movieId, Rating.rating, Movies.genres,Rating.timing
FROM Rating
LEFT JOIN Movies
ON Rating.movieId=Movies.movieId) AS derivedTable
WHERE genres LIKE '%'''+genre+'''%'
ORDER BY timing DESC
'''
query_result=db.run_query(query)
query_result['timing']=pd.to_datetime(query_result['timing'])
query_result.set_index('timing', inplace=True)
query_result.resample("AS").mean().plot(marker="o", markersize=1,figsize=(20,5))
plt.title(genre+' Movie rating trends by yearly')

The output looks like this,

Observation: This one is interesting there is no yearly trend in Movie ratings when resampled on yearly basis the result is a random cyclic time series But if your objective is ND forecasting you can forecast any genre based on the others because they all are highly correlated and can be model together.

Rating distribution for Movies

Finally, let’s give a quick glance at the rating distribution for Movies

#executing query over rating table
query = '''
SELECT rating, count(rating)
FROM Rating
GROUP BY rating
ORDER BY rating
'''
rating_result=db.run_query(query)
y = np.array(rating_result['count'])
mylabels = rating_result['rating']
explode =[0.3,0.1,0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1]
textprops = {"fontsize":15} # Font size of text in pie chart
plt.pie(y, labels = mylabels,explode = explode,shadow = True,radius = 3,textprops =textprops)
plt.legend(title = "Rating",loc="lower left")
plt.title("Rating distribution for the Movies",fontsize=15)
plt.show()

The output looks like this,

Observation: Looking at the rating distribution it is easy to infer people rarely rate movies in 0.5–1.5 window tho it is super common among users to rate movies between 3.0–4.0

Conclusion

Today in this part of the series, I have performed EDA that’s one of the most crucial steps in the ML pipeline. EDA helps you better understand your data before making any assumptions it let you investigate obvious errors, as well as better, understand patterns within the data.

In the next part of this series, we will explore the hybrid recommendation algorithm and will implement it from scratch using only NumPy and pandas, the later parts of this series will further explore the Feature Engineering, model training, model evaluation, and deployment.

if you have any questions or concerns, please contact me here

until next time…

--

--

Syed Muhammad Hamza
Analytics Vidhya

| Machine learning Engineer | Research Assistant | Software developer |