Basic Steps When Cleaning a Data Set Using Pandas

Helpful tips to make data cleaning a breeze

Will Newton
6 min readJul 27, 2020

There are a few things in this world that will never happen: you will never find a straight line in nature, you will never run a 2-minute mile, and you will never work with a data set that has been perfectly cleaned. Missing values, incorrect data types, and redundant columns can be a time consuming blocker for a budding data scientists project.

My instructor at Flatiron School was fond of saying that when building your own data set, “80% of your time will be spent cleaning and performing EDA (exploratory data analyst)”. Hopefully this post can help solve some common issues that many data scientists have when just starting out.

Importing Data and Inspecting First Few Elements

Let’s start with a simple example and show how you can clean a data set thoroughly with just a few lines of code. The .csv file and code for this blog post can be found on my GitHub at the link below:
https://github.com/will4865/Data-Cleaning-Pandas

We’ll start off by importing the pandas library and loading the data set into a pandas dataframe and inspecting the first 5 rows using the pd.head() method.

import pandas as pd
df = pd.read_csv('tn.movie_budgets')
df.head()

We now have a data frame with 6 columns (id, release_date, movie, production_budget, domestic_gross, worldwide_gross). Let’s take a look at how many rows are in the dataframe and what data type each column is. We can do this by using the pd.info() method.

df.info()

Dropping Unnecessary Columns

Looks like 5 out of the 6 columns are listed as non-null object (usually string format) and the id column is listed as an integer column. Let’s go ahead and drop the id column since it seems unnecessary for our purposes today. We can use the pd.drop() method and pass the parameters columns and the name of the column to drop. We also set the inplace parameter to True so that the method saves the new version of the dataframe over the old one.

df.drop(columns = ['id'], inplace = True)

Changing String Format to Date Format

Next up, we have the release_date column which is right now classified as a string datatype. It would be helpful to convert this column to a data type where we could sort the rows from newest to oldest to assist our EDA. The pd.to_datetime() method is a handy way to do that. In order to show the difference, let’s add a new column to this dataframe where we can see the converted datetime format. The pd.to_datetime() method will automatically parse the column to pass through it and convert it into the new datetime format.

df['datetime_format'] = pd.to_datetime(df['release_date'])

Now that we have this new column added, we can use the pd.sort_values() method to sort the entire data frame from newest release date to oldest. In order to do this we must pass in the column we wish to sort on, and set the ascending parameter to False so that the dataframe will be sorted from newest release date to oldest.

df.sort_values('datetime_format', ascending = False)

Looks good! Let’s drop the original release_date column and reorder the datetime_format column to the front.

df = df[['datetime_format', 'movie', 'production_budget', 'domestic_gross', 'worldwide_gross']]
df.head()

Using For Loop to Clean Multiple Columns at Once

Now let’s take a look at the last 3 columns of the dataframe: production_budget, domestic_gross, worldwide_gross. We know from using the pd.info() method earlier in the article that these columns are listed as string data type. If we want to be able to visualize this numeric data down the line, it will need to be converted to numeric values. Before we can do this though, we will need to remove $ and ‘,’ from every value in these columns. We can accomplish this using a for loop.

A for loop will iterate through every element of a list or series and make adjustments based on the function you use in the for loop. In this instance, we want to remove the dollar sign and comma for all 5,782 rows in this dataframe. Let’s start with the dollar sign. We will start by defining a list in Python of the columns that we want to clean and then write a for loop that will iterate through all the rows we defined and remove the $ using the .replace() method. This method requires the first value to be a tuple where the first element of the tuple is what is to be removed and the second element is what will replace it.

cols_to_clean = ['production_budget', 'domestic_gross', 'worldwide_gross']
for col in cols_to_clean:
df[col] = df[col].map(lambda x: x.replace('$',''))

Looks like it worked great! Let’s move on to the commas. As we saw in a previous example, some of the numeric values in the last 3 columns were just the number 0. We we write a similar loop like the one above once the for loop gets to an element without a comma, it will throw up an error. We will need to include some if statements that will check to make sure the comma occurs in each element before it uses the .replace() method. Using list comprehension with an if/else statement makes the code cleaner.

for col in cols_to_clean:
df[col] = [x.replace(',', '') if ',' in x else x for x in df[col]]

Lastly, we can use the same for loop structure to convert all three columns to int64 format. Running the pd.info() method again will show us our updated column data types.

for col in cols_to_clean:
df[col] = df[col].astype('int')
df.info()

Now that we have our columns in the proper format, we can use the pd.describe() method to explore their distributions and some basic measures of central tendency.

df.describe()

There are many, many more steps and possible variables when it comes to data cleaning, but hopefully this has been a good start on your data cleaning journey.

--

--