Why you should NEVER skip EDA in your ML Projects: End-to-End ML (Part 1/3)

Published in

Analytics Vidhya

7 min readSep 10, 2024

Let me tell you about the time I learned the hard way how skipping exploratory data analysis (EDA) can lead to a data disaster. I was working on a customer churn prediction model, eager to jump straight into model building. I thought, “Who needs EDA? Let’s get to the fun part!”. At first, everything seemed perfect — accuracy, precision, all the metrics looked great.

But soon, weird patterns emerged. Customers who were high churn risks weren’t being flagged, and safe customers were marked as risky ones. Something was clearly wrong, and it wasn’t the model. Swallowing my pride, I went back to do the EDA I’d ignored. Sure enough, the data was a mess — missing values, mislabeled columns, and outliers everywhere. My shiny ML model had been built on a swamp of bad data.

After a proper EDA and cleaning the data, the model performed far better. Lesson learned: EDA isn’t optional. It’s the groundwork for a solid, reliable model. Skip it, and you’re flying blind.

Project Introduction

The project that I’ve taken upon myself this time is also, you guessed it, another end to end ML project. I have been obsessed with end to end projects since I had the thought, “Hey, why just stop at building a model? Why not let people try out your fancy model?”, and I haven’t looked back since.

This time around I’m building a movie recommendation algorithm. The final product will be a website where you can enter a movie that you love and get recommendation of movies that other users love that also loved the movie that you love. If that was too many “love”s to describe the method known as “Item-based Collaborative Filtering”, try the visualization below instead.

Dataset Introduction

For this project, we will not be scraping data from the internet as we did last time around. Instead we are using an already compiled dataset by a research lab called GroupLens — which is part of the Department of Computer Science and Engineering at the University of Minnesota. The dataset selected for this application has 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. To make sure I don’t make the same mistake twice, I started my ML project with EDA this time around. I’ve been humbled.

“Mistakes are meant for learning, not for repeating” — someone wise.

Exploratory Data Analysis (EDA) on MovieLens-100k

Movies in the dataset span from 1902 to 2018 — 116 years. The film industry has gone through multiple eras since it began its journey in the late 19th century. Hence it is worth looking at how users perceived each era to see if there are any interesting insights coming from the data.

Silent Era (1902–1929) : This period covers the early years of cinema before synchronized sound was introduced. It’s marked by silent films and early experiments with film techniques.

Golden Age of Hollywood (1930–1959) : This era includes the peak of the studio system, major film genres like musicals and film noir, and iconic stars. It’s also the time of significant advancements in film technology and storytelling.

New Hollywood (1960–1979) : Marked by the rise of innovative directors and films that challenged traditional norms. This era saw the emergence of independent films and new narrative styles.

Blockbuster Era (1980–1999) : Defined by the rise of high-budget, high-grossing films and the dominance of blockbuster franchises. It’s also a time of significant technological advancements like CGI.

Digital Revolution Era (2000–2009) : Digital filmmaking emerged, with CGI-heavy films like Avatar and the rise of DVDs transforming home viewing. Early digital distribution and file-sharing began disrupting traditional movie consumption.

Streaming Age (2010–2018) : Streaming services like Netflix dominated, shifting audiences away from physical media. Digital effects became standard in blockbusters, further solidifying the digital age in both production and consumption.

Later we looked at the number of ratings movies had across their release years and the eras that we introduced earlier using feature engineering. As you can see from the visualization below, we end up with a left skewed distribution when we spread the movies from 1902 to 2018. This is to be expected as newly released movies need more time to be rated by users and movies that were released before 1980, well, I think that’s pretty self explanotary. The below spread shows movies that were realeased in the 90s and early 2000s are the ones that were most rated by users.

What does this say about the performance of our future model? Since movies that were released in the 90s and early 2000s have the most amount of ratings, it is to be expected that the users of our application get the most accurate results when they enter a movie from that period compared to a movie from the 80s or 2010s. These are the type of insights that can be extracted by carrying out EDA. Knowing the limitations of this model at this stage, we can find ways to overcome those or make users of the application aware of them beforehand.

How to decide what to explore during EDA

Since we also have data on the movie title and genre, we can also use those to find interesting patterns in the dataset we are working with. This is the time you take your notebook and go on a walk to ask yourself this simple question: Given the available data at hand, what are some questions I can ask from this data? Below are some that came to my mind:

1. Which movies were most rated each era?

2. Which movies were rated highest each era?

Before we dive into genres, you might notice that genres are not presented in the data in the most useful way. Therefore, we need to convert that horizontally organized data to the long format to analyze.

3. Which genres were most rated each era? (Similar code as above)

4. Which genres were rated highest each era? (Similar code as above)

Since the most rated movies were chosen to be reviewed by the most, it seems fair to loosely associate these movies to the “most popular” movies of their respective era: Casablanca (1942), Start Wars Episode IV (1977), Forrest Gump (1994), The Lord of the Rings: The Fellowship of The Ring (2001), and Inception (2010). Thinking along the same lines for genre, drama, comedy, and action seem to be the most popular genres across the eras.

Now that we have looked at the most popular (most rated) kids in school, let’s take a look at the kids with the highest grades (highly rated): Rear Window (1954), Lawrence of Arabia (1962), Shawshank Redemption (1994), The Departed (2006), and Toy Story 3 (2010). As expected, genres were led by documentary and film-noir across eras — the nerdy ones.

That wraps up part 1 of a three-part series of building a movie recommendation system from scratch. We will be going into the works behind the algorithm next time around — stay tuned! Till then, don’t forget to do EDA at the start of every ML project.

Click here to access the complete notebook 💻📝