Data Science Secrets
Guide to Data Cleaning for Data Science
How to clean data like a professional
Data cleaning is often overlooked as it is the least interesting part of being a Data Scientist. However, data cleaning is the most important part of the Machine Learning process and Data Scientists spend a lot of their time going through all of the data within a database. They then either remove or update information that is incomplete, incorrect, improperly formatted, duplicated, or irrelevant.
Advantages of Data Cleaning:
- Improves efficiency.
- Simplifies the decision making process.
- Increases productivity.
- Improves the quality of the data.
- Ensures that data is upto date.
- Prevents ‘garbage in garbage out’ error.
Data Sanity Check
Sanity checks are often used in the context of software, but it is an accurate term when it comes to data cleaning. Your data needs to make sense and be of use for the problem that you are trying to solve. A sanity check helps you ensure that the data is suitable for your analysis.
Framework for Data Cleaning
Step 1: Remove duplicates at id level, that is, the level at which the rows should be unique.
Step 2: Transform qualitative data into quantitative data by mapping strings to integers.
Eg: for a hotel, they offer packages for 2 days, 5 days and 10 days. We can encode the data as: 1=2 days, 2= 5 days and 3 = 10 days
Step 3: Handle outliers
Check outliers on all key variables, especially the computed ones.
Step 4: Handle missing values, columns etc.
Check for blank columns, large % of blank data, high % of same data
- Look for columns which are entirely blank. This can happen in case some join fails or in case there is some error in data extraction.
- Check the % of blank cases by each column and frequency distributions to find out if the same data is being repeated in more cases than expected.
Follow the link above for my guide on handling missing values.
Check the quality of the cleaning tha has been done, by conduting one or both of the following tests:
- Synchronisation Test
Check whether all columns they are in sync with each other. That is, check if they are in chronological order.
2. Log Test
If your data is perfectly clean, a simple query, such as displaying logs of the variables, should return the right result. If not, you may have to go back and check what you missed.
Data Cleaning Checklist
- Remove HTML characters.
- Decode encoded data.
- Remove or substitute NULL values
- Handle zero values
- Handle negative values
- Handle date values
- Remove unnecessary values
- Remove stop-words
- Remove punctuation
- Remove expressions
- Split words that are attached
- Check min and max for each column to ensure that they make sense
- Remove URLs
- Check Grammar
- Check Spellings
- Incorrect entries
- Geographic coordinates must be within -180 to 180 degrees latitude or longitude.
Note: Credit values can be shown as negative numbers
These are the most common data cleaning methods used. However, every dataset is unique and the end use of the dataset varies greatly from case to case. So the cleaning process depends on what you plan to do with the dataset and what outcome you hope to achieve.