Python & pandas: serving data cleaning realness. You better wrangle!

Pyariksha Tiluk
The Startup
Published in
5 min readJul 6, 2019
#RuPaul 💕

I come from a business background, and more specifically a financial accounting and auditing background, which means I have undoubtedly encountered extraneous manual processes and data analysis methods (insert flashback to a cluttered office desk and little, frazzled me buried behind stacks of files and paperwork fighting the urge to run to my car, race to the nearest beach and create a bonfire whilst using the whole lot as kindling - just kidding, it’s illegal in Cape Town to create beach fires without getting a permit first).

In one of my first gigs, I worked as a financial accountant which was quite eye-opening for a recently graduated millennial as I was surprised to see that actual printed paper and excel spreadsheets were used for all analysis and data storage! How retro and anti-earth/life… This is probably when my interest in using technology for enterprise automation and process improvement first began. Good old,first-hand experience!

SQL was the natural evolution from excel for me and once I had up-skilled, and subsequently transitioned into a business analyst role, I started to become more advanced in data manipulation with structured data. This inevitably led me to seek more robust methods and tools over time as the data requirements became more complex and the data itself become larger. Data wrangling skills became mandatory for me.

Fast-forward to present day, all hail the holy library in python for data analysis — pandas! I am a believer which is why I want to share some of the basics which have added so much value to my analysis.

Now that I have your attention (one can only hope)…

Data Wrangling basics in Pandas:

There are a two important basics in data wrangling using pandas which I would like to touch on as they have formed the foundation of most of my data pre-processing.

  1. Dealing with missing data
  2. Data Binning

1. Missing data

There are few options for dealing with missing values in pandas, however, this will largely depend on the context of your data/business requirements.

Firstly, why is this important? Null values make our data mucky and difficult to work with further down the line when you are, for example, building statistical models or even just viewing basic descriptive statistics.

Initially, we would need to check if there is missing data and if so, in which columns.

The below code will give us an output showing which columns in the whole dataframe have null values (NaN) in them. Missing_data (Line #7&9) will give us a Boolean output with all True values being the null values.

We can also use the .info() method and check if the lengths of certain columns are less than the length of the dataframe using the len() function.

It is important to note that your data may have null values showing up differently as “?” or “0”. It is advisable to scan your data and if the null values are not NaN then you can use the .replace() method as shown in (Line #5).

Now that we have spotted the missing data, made sure they are all NaN and we know which columns they are in we can decide how to deal with them.

As shown above, we can do the following to deal with missing data:

Drop the data:

  1. Drop the whole row (line #2)
  2. Drop the whole column (the same as line #2 but with axis = 1)

Replace the data:

  1. Replace NaN with the mean (numerical data) (line #8)
  2. Replace NaN with the frequency (categorical data) (line #14)
  3. Replace with another function (here we can use a function with np.vectorize() or the .apply() method)

2. Data Binning

This is probably what I use the most when pre-processing my data because an important aspect of data preparation is ensuring the data is meaningful and easy to understand.

Data binning allows you to split continuous numerical data into bins for ease of grouping and visualization.

Categorical data can also be binned in a way to add meaning to data which has arbitrary values that mean something else based on a business requirement specification or index. I would refer to this as data mapping or flagging. I have not included the code for this.

The simplest way to create an evenly spaced sequence for continuous, numerical data is to use np.linspace():

In (line #5) we use pd.cut() to create a new column that has our bins.

For completeness, this new binned column can be visualized using matplotlib:

Final thoughts…

Data wrangling is an extremely broad topic and there are a multitude of ways to clean and work with data to get it from “raw” to “information”.

Pandas in combination with numpy, which it is built upon, allows us to seamlessly transform our data. The benefits of this far surpass the excel spreadsheets I used back in the day (approx. 2 years ago 😂) and the manual methods of data analysis because results can be obtained in milliseconds, visualizations can be done immediately and we can work with large data.

Furthermore, models such as linear regression will be based on the wrangled data thus this is an extremely vital step and worth the effort. Not every data analysis gig will focus on model building but all will require data wrangling and data manipulation in some form or the other.

You better wrangle!

--

--