8 common techniques for data cleaning — Python

Ethan Duong
7 min readJan 11, 2023

--

Data cleaning is definitely one of the most time consuming, but mandatory, during data analyzing process. Steve Lohr of The New York Times said:

“Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”

Without the data cleaning process, further analysis will be disrupted and the result might be bias. Unarguably, data preparation is a vital step that requires critical mind and strong ability to use cleaning tools and techniques.

This writing will layout how I used several Python techniques to have the csv dataset cleaned and readied for analyzing process.

Layout:

  • Why Python ?
  • Data cleaning process?
  • Python techniques will be covered:

1. Remove multiple Irrelevant columns from the dataset

2. Remove multiple rows if contains specific substrings

3. Handle null and missing values

4. Transform column value if it contains specific substring.

5. Remove substring from string column

6. Split string column by separator and convert result to int type.

7. Strip letters from the left of a string column

8. Count number of rows that contains substring appears in a column with conditions.

Why Python?

I have work with dataset by using various tools such as: R, Excel, OpenRefine, and Trifacta Wrangle and Python is the best option because:

  • Simple syntax, easy to share code, similar to English language.
  • Supported by Jupiter Notebook, allow you to break down your code into smaller parts. (‘You can run each part of code separately’).
  • Useful libraries to work with data: NumPy, Pandas, and Matplotlip.
  • Can handle much larger volumes of data, and therefore, more analysis.
  • Can easily replace mundane tasks with automation.

Cleaning process

Reference from my other piece of writing: Data Science for beginner

Data cleaning is the process of identifying and fixing incorrect data. Below are common steps in data cleaning process:

  1. Remove irrelevant data
  2. Remove duplicate
  3. Standardize capitalization
  4. Convert data type
  5. Handling outliers
  6. Fix errors
  7. Language Translation
  8. Handle missing values

Cleaning techniques by Python

Example dataset: Data Scientist Jobs Dataset

Data Scientist Jobs Dataet

Business Objectives: data need be cleaned and ready to create a dashboard that compares and informs stakeholders about Data Science Employment.

Import libraries and read file

import pandas as pd    #->providing fast data structures designed
import numpy as np #->support multi-dimensional arrays
#Read file
alldata = pd.read_csv('DataScientist.csv') #read and assign csv file

1. Remove multiple Irrelevant columns from the dataset

Base on the business objective, we will focus on information that related to employment and drop company-related columns:
index, Rating, Headquarters, Size, Founded, Type of ownership, Revenue, Sector, Easy Apply, Competitors, Unnamed: 0

to_drop = ['index','Rating','Headquarters','Size','Founded','Type of ownership'
,'Revenue','Sector','Easy Apply','Competitors','Unnamed: 0'] #column drop list
alldata.drop(to_drop, inplace=True, axis=1)

There are many parameters in drop functions that modify the way Python dropping the data. You should also aware of those parameters as it can be useful in other situations.

Check out parameters for drop function: pandas.DataFrame.drop

2. Remove multiple rows if contains specific substrings

By using Spreadsheet, I found 7 rows in Salary Estimate columns that measure salary by hour, while most rows measure by year. Dropping 7 rows will not affect our analysis result, we will remove it from our dataset.

alldata = alldata.drop(alldata
[alldata['Salary Estimate'].str.contains('Per Hour', case = False) == True].index)
#Drop rows that contain substring 'Per Hour'

You can apply this technique for any substrings. For loop can also be used if there are multiple substrings need to be removed (Similar to content in section 5).

Check out parameters for str.contains function: pandas.Series.str.contains

3. Handle null and missing values

Count total of null or NaN value

alldata.isna().sum().sum()
alldata.isnull().sum().sum()

Luckily, the dataset does not contain null or NaN value, this is very unusual as in real world situation, our data will be messy and include multiple missing or NaN values.

len(alldata[alldata['Industry'].str.contains('-1')])

However, there are 548 rows contain ‘-1’ in column ‘Industry’, where most rows should be in string data type, this could be considered missing values.

Dropping those that many rows will not be option, we can try: fill with constant value, fill with aggregate value, replace with previous or next value, using another data frame. However, I will cover these solutions in another blog. We will leave these value in our dataset for now.

4. Transform column value if it contains specific substring.

As we can see, Job Title column contains different title, as it could be hard for Data Analyst to conduct further analysis. Therefore, we will group them based on the letter in the title:

  • Job title contains ‘analyst’ -> assign to ‘Data Analyst’
  • Job title contains ‘scientist’ -> assign to ‘Data Scientist’
  • Job title contains ‘engineer’ -> assign to ‘Data Engineer’
alldata.loc[alldata['Job Title'].str.contains('analyst', case=False), 'Job Title'] = 'Data Analyst'
alldata.loc[alldata['Job Title'].str.contains('scientist',case=False), 'Job Title'] = 'Data Scientist'
alldata.loc[alldata['Job Title'].str.contains('engineer',case=False), 'Job Title'] = 'Data Engineer'

How to use log to alter data value: pandas.DataFrame.loc

However, there are only 386 job titles (using Spreadsheet), among 3888 rows, are not in our job category. We will focus on those 3 roles by dropping other job titles.

#number of unrelated roles
len(alldata)-
(len(alldata[alldata['Job Title']=='Data Analyst'])+
len(alldata[alldata['Job Title']=='Data Scientist'])+
len(alldata[alldata['Job Title']=='Data Engineer']))
#dropping unrealted roles
alldata = alldata.drop(alldata
[(alldata['Job Title'] != 'Data Analyst') &
(alldata['Job Title'] != 'Data Scientist') &
(alldata['Job Title'] != 'Data Engineer')
].index)
# Now, column 'Job Titile' only include our targeted roles.

5. Remove substring from string column

This section is about removing unnecessary substring in Salary Estimate column so we can split and transform this column into integer later.

Remove and transform substring
#remove substring from string column
remove = ['(Glassdoor est.)','$','(Employer est.)',')','(']
for string in remove:
alldata['Salary Estimate'] = alldata['Salary Estimate'].str.replace(string,'')
#replace 'K' with "000" so we can transform it to int type later.
alldata['Salary Estimate'] = alldata['Salary Estimate'].str.replace('K','000')
# Now, we our column looks like figure above.

There are many ways to remove substrings from string column, replace is just one of them. Check out Giorgos Myrianthous article for more methods.

Please also aware of other parameter for replace function: pandas.DataFrame.replace

6. Split string column by separator and convert result to int type.

Now, we will split Salary Estimate column into Estimate Min Salary column and Estimate Max Salary column base on ‘-’ separator. Then converting those two columns to integer type. So Data Analyst can easily work with number in job salary analysis.

Split and convert data type
# Split Salary Estimate column into two 
alldata[['Estimate Minimum Salary', 'Estimate Maximum Salary']] =
alldata['Salary Estimate'].str.split("-", expand = True)
# Salary Estimate column become unncessary 
alldata = alldata.drop('Salary Estimate',axis=1)
# Convert two new column into int type
alldata['Estimate Minimum Salary'] = alldata['Estimate Minimum Salary'].astype(int)
alldata['Estimate Maximum Salary'] = alldata['Estimate Maximum Salary'].astype(int)
# Now, we have two new columns in int type just like figure above.

How to use split: pandas.Series.str.split
How to change column type: pandas.DataFrame.astype

7. Strip letters from the left of a string column

We can see that Company Name column and Location column were not in the desired format, so I will use string slicing to get the right format.

String Slicing
#Slice Location and Company Name string 4 index away from the left. 
alldata['Location'] = alldata['Location'].str[:-4]
alldata['Company Name'] = alldata['Company Name'].str[:-4]
#Now, we get desired result as figure above

There are many ways to strip string column, string slicing is just one of the method that I found most convenient. Check out this article for more methods.

Find out more about string slicing: note.nkmk.me

8. Count number of rows that contains substring appears in a column with conditions.

We will work with Job Description column in this section.

In order to define what skills are necessary for different data science roles, We need to figure out the number of appearances of different skills in Job Description for the three roles.

Let assume these are the skills that we want to find the frequency of appearance in each Data Analyst job: SQL, Python, R, Power BI, Tableau.

#skills we want to count 
skills_set = ['sql','python','power bi','tableau','excel',' r ']
#using for loop to frequency of each skills
for skills in skills_set:
print('Number of job that requires',skills,':',len(alldata[
(alldata['Job Description'].str.contains(skills, case = False)) &
(alldata['Job Title']=='Data Analyst')]))
[Output]: 
Number of job that requires sql : 585
Number of job that requires python : 274
Number of job that requires power bi : 98
Number of job that requires tableau : 278
Number of job that requires excel : 543
Number of job that requires r : 64

We can use this technique to find skills frequency of the other two roles.

Note of writer:

Thank you for being here, I hope it helps you in some ways.
These are some of my techniques I have been using, I believe there are multiple ways and other techniques that also helpful for data cleaning process.

If you have any concerns, you can contact me through email: ethan.duong1120@gmail.com

Have a great day !

--

--

Ethan Duong

The place to share what I've learned, mostly tech-related ! Trying to keep the knowledge from fading overtime :) Reach me at ethan.duong1120@gmail.com