5 Minute EDA: Data Wrangling in Python

Published in

5 Minute EDA

4 min readFeb 19, 2022

Basic tips on how to wrangle any dataset in Python

When working with a new dataset, one of the first steps you take is to clean the data. This article is a step by step guide on how to wrangle any dataset. But first, let’s start with some definitions:

What is Data Wrangling?

This blog post from Harvard Business School provides a very thorough explanation of what data wrangling is:

Data wrangling — also called data cleaning, data remediation, or data munging — refers to a variety of processes designed to transform raw data into more readily used formats. The exact methods differ from project to project depending on the data you’re leveraging and the goal you’re trying to achieve.

So in other words, it’s a way for you to clean the data before you can run any manipulations on them. Most raw data in the real world is extremely messy. Therefore, it is very important to first spend a good chunk of time wrangling data into a digestible format that can be used.

Here are the basics steps to wrangling any dataset:

Step One: Ingest data

I will be using a dataset from Kaggle, relating to startups found in the Crunchbase database. Let’s pull in the dataset (stored as a csv) and take a look at the first 10 rows:

df=pd.read_csv("CAX_Startup_Data.csv")
df.head(10)

Immediately, I notice some things. I see fields that are missing but labeled “No Info” instead of the default NaN. This discrepancy is going to be a problem if I want to isolate missing values, because words such as “No Info” are recognized in the system as populated string values, unless I otherwise specify them to be congruent with missing.

Step 2: Take an initial look at your dataframe to see obvious inconsistency in data

After doing a thorough search of the dataframe, I find more terms to define missing values. Let me re-ingest the file, this time making sure that all the variations of strings are defaulted to missing values:

missing=["none","None","No Info"]
df=pd.read_csv("CAX_Startup_Data.csv",na_values=missing)
df.head(6)

Perfect.

I don’t see any more obvious data issues by first glance so I proceed to the next step.

Step 3: Rename variables for easier referencing

In this dataframe, the variable names are extremely long! I want to shorten this now so that the rest of my cleaning can be easier (I won’t have to repeat long variable names over and over again in my code). The “inplace=True” ensures the renaming is permanent moving forward with the dataframe.

df.rename(columns={'Capital intensive business e.g. e-commerce, Engineering products and operations can also cause a business to be capital intensive':"Capital_intensive", ...'Survival through recession, based on existence of the company through recession times':"Survival", inplace=True)

Step 4: Check to ensure that the datatype is consistent with the information stored in the fields

You can run into errors if your fields are stored in the wrong datatype. For example, if you have a numeric column that is stored as an “object,” you can get errors in your code when you try to run operations on that field.

Let’s check the datatype of each of the fields.

datatypes=pd.DataFrame(df.dtypes,columns=["Data Types"])
datatypes

snapshot sample of variables and their datatypes

Fields like “Est. Founding Date” and “Last Funding Date” are dates, and should not be stored as objects. Instead, they should be stores in datetime format.

df["Est. Founding Date"]=pd.to_datetime(df["Est. Founding Date"])
df["Last Funding Date"]=pd.to_datetime(df["Last Funding Date"])

Step 5: Check for missing values

A very important step in the data wrangling process is to look at the landscape of the data collected. This involves looking at what percentage of the data collected is actually usable.

null=pd.DataFrame(df.isnull().sum(),columns=["Null Values"])
null["% Missing Values"]=(df.isna().sum()/len(df)*100)

Depending on your preferences, you may want to remove fields in which a majority of the data is missing. In my case, I only kept fields where the missing frequency was 30% or under.

Step 6: Final look-through of the dataset

As a final step, I consolidate entries with different formatting into one type. For example, if a variable has a responses “yes” that is written in various ways (ex. YES, Yes, yes, YeS, etc), I consolidate them all into one format.

df["Local or global player"]=df["Local or global player"].replace({"local":"Local","local":"Local","LOCAL":"Local","global":"Global","GLObaL":"Global","GLOBAL":"Global"})df["Has the team size grown"]=df["Has the team size grown"].replace({"yes":"Yes","YES":"Yes"})

This concludes the data wrangling basics. Thank you for reading!

This is part of my 5-minute EDA series, where I run quick exploratory data analysis on an interesting dataset. Thanks for reading!