Data Science Secrets

Guide to Data Cleaning for Data Science

How to clean data like a professional

Vinita Silaparasetty
Feb 4 · 3 min read
Photo by fabio on Unsplash

Data cleaning is often overlooked as it is the least interesting part of being a Data Scientist. However, data cleaning is the most important part of the Machine Learning process and Data Scientists spend a lot of their time going through all of the data within a database. They then either remove or update information that is incomplete, incorrect, improperly formatted, duplicated, or irrelevant.

Advantages of Data Cleaning:

  1. Improves efficiency.
  2. Simplifies the decision making process.
  3. Increases productivity.
  4. Improves the quality of the data.
  5. Ensures that data is upto date.
  6. Prevents ‘garbage in garbage out’ error.

Data Sanity Check

Sanity checks are often used in the context of software, but it is an accurate term when it comes to data cleaning. Your data needs to make sense and be of use for the problem that you are trying to solve. A sanity check helps you ensure that the data is suitable for your analysis.

Framework for Data Cleaning

Step 1: Remove duplicates at id level, that is, the level at which the rows should be unique.

Step 2: Transform qualitative data into quantitative data by mapping strings to integers.

Eg: for a hotel, they offer packages for 2 days, 5 days and 10 days. We can encode the data as: 1=2 days, 2= 5 days and 3 = 10 days

Step 3: Handle outliers

Check outliers on all key variables, especially the computed ones.

Step 4: Handle missing values, columns etc.

Check for blank columns, large % of blank data, high % of same data

  • Look for columns which are entirely blank. This can happen in case some join fails or in case there is some error in data extraction.
  • Check the % of blank cases by each column and frequency distributions to find out if the same data is being repeated in more cases than expected.

Step 5: Handle missing values

Follow the link above for my guide on handling missing values.

Quality Check:

Check the quality of the cleaning tha has been done, by conduting one or both of the following tests:

  1. Synchronisation Test

Check whether all columns they are in sync with each other. That is, check if they are in chronological order.

2. Log Test

If your data is perfectly clean, a simple query, such as displaying logs of the variables, should return the right result. If not, you may have to go back and check what you missed.

Data Cleaning Checklist

  1. Remove HTML characters.
  2. Decode encoded data.
  3. Remove or substitute NULL values
  4. Handle zero values
  5. Handle negative values
  6. Handle date values
  7. Remove unnecessary values
  8. Remove stop-words
  9. Remove punctuation
  10. Remove expressions
  11. Split words that are attached
  12. Check min and max for each column to ensure that they make sense
  13. Remove URLs
  14. Check Grammar
  15. Check Spellings
  16. Incorrect entries
  17. Geographic coordinates must be within -180 to 180 degrees latitude or longitude.

Note: Credit values can be shown as negative numbers

These are the most common data cleaning methods used. However, every dataset is unique and the end use of the dataset varies greatly from case to case. So the cleaning process depends on what you plan to do with the dataset and what outcome you hope to achieve.

Connect With Me:

Facebook, Twitter, Quora, Youtube and Linkedin.#AI

Vinita Silaparasetty

Written by

Data Scientist | AI | ML |Neural Networks| Author |Speaker | Data Science Trainer|

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade