Data Prep for Machine Learning Checklist

20 tasks every data scientist should check off BEFORE modeling

Alice Zhao
Learning Data
Published in
6 min readAug 8, 2023

--

Photo by Glenn Carstens-Peters on Unsplash

You’ve just learned how to do a lr = LinearRegression() in Python. Congrats! You’re excited to start building machine learning models…

But you run into some issues.

Your data is dirty. You realize you can’t input text or date fields into a model. The list goes on and on.

The reality is that data scientists do a ton of data prep work BEFORE modeling — up to 80% of their time is spent on data prep tasks.

I’ve compiled a checklist of 20 data prep steps you should check off before building machine learning models.

Scoping a Project

Before diving into modeling, it’s important to take a step back and think about why you’re modeling in the first place.

Step 1. Think like an end user
Step 2. Brainstorm problems & solutions
Step 3. Solidify the ML techniques & data requirements
Step 4. Summarize the scope & objectives

Photo by Parabol | The Agile Meeting Toolbox on Unsplash

You want to start by thinking of your end user, or the person / team that’s going to benefit from your analysis. What are their goals? Then together with your end user, you’ll want to dig into the problems that they’re having and brainstorm solutions.

At this point, you may discover that while machine learning can be one solution to the problem, there might actually be a better alternative and machine learning is not needed at all!

If you do decide to proceed with ML, then it’s important to solidify all the technical details at this point — are you going to use supervised or unsupervised learning techniques, where are you going to get the data, etc. Finally, you’ll want to summarize your project goal and scope into a few sentences to be your guiding star throughout your project.

Gathering Data

Now it’s time to get our hands on some data!

Step 5. Locate data from multiple sources
Step 6. Read data into Pandas DataFrames
Step 7. Quickly explore the DataFrames

Photo by Markus Spiske on Unsplash

Once you’ve identified your data sources and scope, the next step is to actually find that data, read it into Python as Pandas DataFrames and quickly explore the data using methods like .describe() and .info() to make sure the data was read in correctly.

Keep in mind that you don’t have to gather all your data upfront. You can always start with one or two data sources, and continue to include more data as you’re cleaning and modeling.

Cleaning Data

This is where data scientists spend the majority of their time.

Step 8. Convert data to the correct data types
Step 9. Identify and handle missing data
Step 10. Identify and handle inconsistent text & typos
Step 11. Identify and handle duplicate data
Step 12. Identify and handle outliers
Step 13. Create new fields from existing fields

Photo by JESHOOTS.COM on Unsplash

One of the first things I like to do when reading in new data is to review the data types of the fields. If anything seems unusual (i.e. a numeric field was read in as a text field), it’s nice to resolve it upfront before running into issues down the line.

Next, I go into the main part of data cleaning — dealing with messy data issues. I’ve included four of the most common issues that I’ve seen in practice — missing data, inconsistent data, duplicate data and outliers.

It’s important to resolve these issues before modeling because a model is only as good as its data.

Finally, before moving on to exploratory analysis, I like to create new fields based on existing fields, so extracting years and months from date fields, combining fields with a calculation or concatenation, etc.

Exploratory Data Analysis

Once the data is mostly clean (remember that it’s rare to have perfectly clean data!), this is where the fun begins.

Step 14. View the data from multiple angles
Step 15. Visualize the data to quickly identify trends & patterns

Photo by Myriam Jessier on Unsplash

With Exploratory Data Analysis (EDA), you can start to discover insights by viewing your data in different ways by filtering, sorting and grouping your data, and also by visualizing your data with histograms, scatter plots and pair plots.

Visualizations are useful for both finding patterns that can be shared as insights and also finding anomalies that can lead to further data cleaning.

Preparing for Modeling

The final step before modeling is to get your data into a very specific format that you can input into a machine learning model.

Step 16. Create a single table
Step 17. Set the correct row granularity
Step 18. Ensure each column is non-null and numeric
Step 19. Engineer new features
Step 20. Split the data into training, validation & test sets

Photo by Markus Winkler on Unsplash

First, you’ll want to create a single table that holds all of your data, including both the features and the target variable.

From there, you need to determine what one row of data should look like. If you’re making predictions about a customer, then one row of data should represent a customer instead of each one of their purchases. This is where a .groupby() comes in handy!

Once you have the correct row granularity, you’ll need to make sure that your data is non-null and numeric, by potentially imputing data, creating dummy variables, etc. It’s also important to think about additional features or columns that you can add to your table that could be good predictors or differentiators for your model.

Finally, if you’re applying a supervised learning algorithm, you’ll want to split your data into training, validation and test sets so that you can fit the model on the training data, assess it on the validation data, and finally score it on the test data.

And that is your 20 step checklist of things to do to prepare your data for modeling.

I’ve created a course that goes through each of the data prep steps in detail called Data Science in Python: Data Prep & EDA that can be found on the Maven Analytics platform and Udemy.

If you’re an aspiring data scientist, this is a great course to kick off your data science journey and ensure that you have the foundational data skills for modeling.

Happy learning!

Ready to build practical, job-ready data skills of your own?

Spring Savings: Up to 40% off at Maven Analytics!

Create your custom learning plan today, and save up to 40% on all-access memberships when you upgrade to a paid account.

All Maven memberships include:

✓ Unlimited access to ALL courses & paths

✓ Customized learning plans

✓ Skills assessments

✓ Free practice data sets

✓ Guided projects

✓ Portfolio builder & Showcase

✓ Private student dashboard

✓ Live instructor chat support

Join today and see why we’ve earned 50,000+ perfect 5-star reviews from students around the world.

This is a limited-time deal; take advantage of the savings today!

--

--

Alice Zhao
Learning Data

Hi! 👋 I'm a data scientist & author of the SQL Pocket Guide (O’Reilly). Check out my Data Science in Python series on Maven / Udemy & my blog, A Dash of Data.