Data Cleansing using Pandas in Python

Shan Jaffry
Analytics Vidhya
Published in
4 min readSep 10, 2020

In this post, we will be using the Pandas library with Python to demonstrate how to clean the real-world data so that it is ready-to-be-used for any processing task. Frankly, most of the data obtained from real-world sensors contain a lot of garbage and null values. Oftentimes, data are in a format that is not supported by data analysis or machine learning algorithms. Hence data cleansing is almost always the first step in most data analytics/ML/AI jobs.

Pandas is an immensely powerful library for data manipulation. Earlier, I was a huge Matlab fan and thought that there is no match for Matlab when it comes to data analysis. But ever since I have moved to Pandas (Python actually), I hate going back to Matlab again.

Anyways, keep Matlab aside now, and let’s begin with Pandas.

The data that I will be using is the real-world call data record (CDR) that was made public by Telecom Italia as part of a Big Data competition in 2014. You can find and download the complete dataset here (I should warn that this is a very huge data set. Each file is over 300 MB, and there are 62 files in total — one file per day). I am sharing one file only that I will be using in this blog via my google drive. Click here to get that file.

The dataset captures the calls, SMS, and internet usage of Telecom Italia’s users in the city of Milan, Italy for two whole months. Each day is recorded as a single file. However, for this blog, I will only be using the data for a single day (i.e. single file) from this CDR.

Let’s first start with importing all the necessary packages

import pandas as pd
from pandas import read_csv

After importing all the necessary packages, let’s do the real stuff. Pandas provide a function read_csv(…) (which we have imported earlier) to read different kinds of data files. In our files, the data is stored in a tab-delimited format. Hence we will use <delimiter = ‘\t’> argument to specify during the reading process to break whenever a tab (\t) exist in the file.

I always prefer not to mess with the actual variable that stores data. Hence, we will clone data in another dataframe and call it df. The command df.head() will display the first few members of the dataset.

dataset = pd.read_csv('D:/Research/dataset/sms-call-internet-mi-2013-11-1.txt', delimiter='\t')
df = dataset
df.head()

Observe that the dataset is in a raw format (See Fig. 1 below). Even the names of the columns are not mentioned.

Fig. 1: Raw data from Telecom Italia

First of all, we will give appropriate names to all the columns using df.columns. In this particular case, the dataset provider (i.e. Telecom Italia) has given all the information about the columns. Hence we will use this information to appropriately name each column.

df.columns = ['Grid ID', 'Time Stamp','Caller ID','SMS in', 
'SMS out','Call in','Call out','Internet']

Data Cleansing

Data cleansing is very crucial and almost always-needed (as mentioned earlier) step when working with real-world data as captured data may have a lot of discrepancies, missing values, etc.

For example, observe that in Figure 1 above that there are several NaN values within the raw dataset. These values indicate the data acquiring sensors could not get any values for whatever reasons. We will replace all NaN values with Zero (0). For this purpose, pandas provide a simple function fillna(…). We will use this function to replace NaN with Zeros (0). Also, note that using inplace = True is equivalent to stating df = df.fillna(0). This is another strong feature in pandas that allow a cleaner and shorter version of code.

The time unit for each record entry is given in milliseconds. Hence we will also change the time unit into minutes. Finally, we will display the formatted data in Figure 2 below.

# Fill all the NaN values with 0
df.fillna(0,inplace = True)

#The time is in milli-seconds. We will change it to minutes.
df['Time Stamp'] = df['Time Stamp']/(1000*60)

df.head()
CDR after cleansing

Observe that all NaN values are replaced by 0. The timestamp has also been changed into minutes.

Finally, we will display the Internet activity to observe what kind of data we have (You can play around with other activities such as Call in, Call out, etc as well). You can try an insert other activities as well.

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.plot(df['Internet'])
plt.grid()
plt.xlabel('Time (in 10-minute interval)', fontsize =12)
plt.ylabel('Activity', fontsize = 12)

We will stop here this week. Next week, we will use this cleaner version of data to predict cellular traffic using Deep Neural Network (Recurrent Neural Net). Till then…

The complete code is present here.

--

--