Why EDA is necessary for Machine Learning?

Srimal Ashish
8 min readJul 29, 2018

--

Sometimes even the things we see with our naked eyes is not the “naked” truth. It needs time,conviction and certainty to get behind the truth. EDA — Exploratory Data Analysis - does this for Machine Learning enthusiast. It is a way of visualizing, summarizing and interpreting the information that is hidden in rows and column format. EDA is one of the crucial step in data science that allows us to achieve certain insights and statistical measure that is essential for the business continuity, stockholders and data scientists. It performs to define and refine our important features variable selection, that will be used in our model.

Once EDA is complete and insights are drawn, its feature can be used for supervised and unsupervised machine learning modelling. The EDA is executed majorly by Uni-variate visualization, Bi-variate visualization, Multivariate Visualization and Dimensionality reduction.

We initially make several hypothesis by looking at the data before we hit the modelling. And its quite a good practice cause that will engage you more with EDA part. EDA helps you in confirming and validating the hypothesis you make. And from here you start your feature engineering part and take a flight to machine learning modelling.

Here, we will explore the power of EDA using a dataset. The dataset is on movie and its rating by the users. You can download the dataset here (use education and development dataset)

Loading Libraries and Dataset

figure: 1

movie.head(5)

figure: 2 movie file

rating.head(5)

Hypothesis 1: Action, Comedy and Thriller might be most released genre.

figure: 3. rating file

Lets look at the data structure and also look for missing values. movie.describe() and rating.describe(), respectively.

figure: 4.

So, 5-point summary (min,25th,median,75th,max) for rating tells that most of the rating is 4, cause 75th percentile is 4 and mean is 3.54. This can be our Hypothesis 2: “maximum rating for movies is given as 4.” Note that dataset is free from missing values- Voilaaaaaa!!!!!!!!

Lets begin our EDA

Now, looking at the ‘genre’ column in movie dataset, we find that genre is string type and separated by ’|’. So, here we will convert it into list of genre and separate it genre-wise and keep it new column ‘Genre_Cat’.

movie[“Genre_Cat”] = movie[“genres”].str.split(‘|’)

figure: 5

Now we will sort films on genre-wise like Animation, Adventure etc. To do that we will use lambda function. It can be extended to other genre too.

figure:6: Animation genre only

Look in the output box. Each row contain Animation in it.

figure: 7.

Now, we will categorize the each movie into its respective genre. That is, it will count number of movies in Animation, in Adventure..like that.

figure: 8

Output:

figure: 9
figure: 10.Pie chart for Genre category

So, Woow…What!!!!.against our Hypothesis 1, we find that ‘Drama’ is the most produced genre in film industry, followed by Comedy, Thriller and so on. Strange.

We all look for movies which has Romance, Comedy, Action, Thriller. So, why not look for movies which has all these genre and call it “Masala movies”. We will create a list of these genre and look for a subset in Genre category.

figure: 11
figure: 12

This is what i was expecting as an Indian. “Sholay” made it to the masala movies. See that all the movies has got all 4 genres. I have seen 3 movies out of these masala movies.

Now, lets look at the rating. Whats the scale of rating, maximum rating, minimum rating, number of user rating etc. First, look for number of rating category.

uni_rate = rating[“rating”].unique()

[“0.5”, “1”, “1.5”, “2”, “2.5”, “3”, “3.5”, “4”, “4.5”, “5” ] So, we have 10 different category for rating the movies. Lets look for number of user who rated the movies.

len(rating[“userId”].value_counts() Output is : 671, So we have total 671 user who rate 1,00,004 movies. Looking at the bar plot.

figure: 13

Amazing!! Our hypothesis 2 seems to be correct this time. Most of the user have given 4 rating to movies i.e. mostly satisfied. Also look closely, a strange thing is observed, rating like 4.5, 3.5, 2.5 and 1.5 are sandwiched between 5,4,3,2 and 1. What can be concluded in a way that people prefer round figure to rate the movie or very few people tend to rate so closely that they prefer decimal: seems they attach themselves to movie too much.

Now lets see the top 10 movies of all the time. Here, we can have hypothesis 3: “Movies like Godfather, Schindler’s List or Shawshank Redemption etc.

Before that we need to merge both data: movie and rating, such that with most number of rating we can see which movie made it to top 10.

figure: 14
Merged data-frame looks like this.

Approach will be like making a smaller dataframe containing only rating, movieId and Title. Then grouping it by movieId and Title to get the total number of rating a movie got.

figure: 15
Top 10 movies

Woow..so here we have Top 10 movies of all time. But, Wait…where is Godfather, Shawshank Redemption? I haven't seen a single movie from this list. Something fishy. Lets look at the number of people who have rated these movie using movieId.

figure: 16

So, thats the reason. Only one person has rated these movies with 5.0 and that why these movies reached to top 10. Now to solve this problem, first we need to count the number of rating a movie got and afterward will have to put one constraint on rating variable like rating>100 or rating>200 such that only those movies will be considered which has got more than 100 or 200 ratings from users.

figure: 17
figure: 18

Here, we have clear picture of movies having rating which got more than 100 rating from user as constrained by us in code. Now we can see real Top 10 movies of 21st century.

figure:19
figure:20

Voilaaa!!! Here we have our hypothesis 3 coming true. We can see some of the beautiful movies making into the list, which we had anticipated. Note that we have merged our dataset again to get the mean of rating each movie got. “rating_x” is the average of rating and “rating_y” is the number of rating that movie got.

If we sort our list on the basis of number of rating a movie got, we will have a different list. See below:

figure: 20

Now lets see in which year we have got maximum number of movie released.This will be a bit interesting, cause here we will learn a new concept.

In rating dataset we have a column named “Timestamp” in seconds. It is calculated as the approximation of the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970. So, first we need to convert it to date/month/year format using “datetime” library and then proceed with our EDA.

figure: 21
figure: 22

Datetime changes timestamp to date/month/year format and from there we have extracted year and months in which movie was released and people rating in the same year.

figure: 23

2000 is the year which has seen most number of movies released. Such a large surge in 2000 can be reasoned with 2 reason, either really such large number of movies were released in that year or people have given more rating to movies in the year of 2000. We are assuming that people have rated movie in the same year as of releasing year. Lets look month-wise:

figure: 24

From, here we can say that people have more inclination to watch and rate movies largely in the month of November and December (beginning of winter), reluctant to go out for a trip. Also in the month of April and June (Summers) people watched and rated more movies.

Although the EDA can go a long way, but to keep the article readable we will look at the last piece of information from this dataset i.e. which user rated the highest number of movie and giving him/her the title of “Movie-Buff”. :P

We are going to look at the Top 10 user, who rated maximum movies.

figure: 25
figure:26

HAHAA..So we have our “Movie-Buff” with us. Its the user number 547, followed by 564, who have rated most number of movies.

More and more number of conclusions and insights can be drawn like Top 10 movies in each genre, Top 10 movies of decades or highest rated movie in last 5 years. So, i have to bring it to end and will continue my thread in upcoming post.

By now you must have realized how indispensable and crucial EDA is in our machine learning model. This gives you the power to understand your data and channelise its inferences to come up with more accuracy, interpretability and stronger machine learning model for the audience. If you liked it, please give it a CLAP.!!:)

--

--

Srimal Ashish

Toward the world of Data Science. Machine Learning and Deep Learning enthusiast. "ElonMusk" follower.