Data analysts and data scientists spend most of their time cleaning and preprocessing their data. This step involves getting the right data, understanding the data, exploring the data for patterns, and cleaning or preprocessing the data before building any models.
In this article, I will explain how a data analyst goes by analyzing the data using Pandas which is a widely used data analysis library for Python.
I will go through a dataset made by Quentin Caudron . It’s a time-series dataset, describing the total number of coffees made by an espresso machine by a certain date.
You can find the dataset here: https://github.com/QCaudron/pydata_pandas
Before proceeding, make sure you download and import the following libraries into your notebook :
Importing and Understanding the Data
Once you have the dataset in CSV format, you can read the file into a data frame using the
.read_csv() method in pandas. This allows users to import a CSV file and convert it to a data frame. Pandas offers many different ways to create a DataFrame by reading different file types and the list of functions can be found here.
Once you have imported the data, you can look at the data to see how it looks and get a feel for the data. You can see the columns using the
df.columns method which prints out a list of column names.
Also, you can get a descriptive statistics summary of the data using
df.describe() . Here you can see the number of rows, the mean of the column, the standard deviation, the min value, the max value, and the percentiles on each column. This is a very handy tool as instead of generating these values repeatedly for all columns, you get all the values in just one call of a function.
Cleaning the Data
Most datasets we get in the real world are not clean and contain some missing values or random garbage values. Hence, analysts spend most of their time cleaning the data and converting it into something useful and meaningful.
In this section, I will explain the following tasks :
- Incorrect data types for columns
- Handling missing values
Incorrect Data Types
Once you understand the data, you know what data types your columns should be in. You can check the data types using
dtypes . This will return a series object with a data type for each column present in the dataset. Columns having object datatype are strings and other datatypes are the well known integer and float.
As you can see in the dataset, the coffees column is an object datatype but that doesn’t make sense as coffees should just be the number of coffees and should be int or float. Thus we will investigate this column further to see why it is showing an object datatype.
If you print the first 5 rows, you will see a value for ‘testing’ in the coffees columns. This is a string and that is why the data type is object. Hence you can drop this row and convert the column to the right datatype using
Also, the timestamp column should be of datetime type and not a string since it will allow us to apply datetime functions on it. When a CSV file is imported, pandas reads the datetime columns as a string and you need to convert it to datetime using the
Handling Missing Values
There are many different ways to handle missing values in your dataset. The approach you use will depend on your problem statement and the dataset you have. But, the main techniques used are to impute some value such as mean or median in place of a missing value or drop the rows containing missing values. You can also drop the columns having missing values. Refer the documentation to learn about how to handle missing values.
Here, I can check the number for missing values in the coffees column using
df[‘coffees’].isnull().sum() . While checking and looking for missing values, you should understand why values are missing. Here, the machine broke a few times hence the values are missing. Thus we can drop the null value rows using the
Visualizing and Analyzing the Data
For visualizing time-series data, it is always recommended to use line charts to understand the trends over a period of time. Hence, you can do the same using this time-series dataset. Pandas has a handy way of plotting graphs immediately using the
df.plot() method. You can look at the documentation here.
We see that in the line plot, in the end, the data is recorded at very irregular intervals and we can remove this from our dataset. This will result in a much uniform time-series plot.
We can also look at the number of contributions made by each person by using the
value_counts() method and then using the
plot() method with kwarg
kind = ‘bar’ . Value_counts() is just like grouping the data on a column then applying the
count() method to it. It just returns the count of each unique value in the column.
Another useful chart would be to see how many contributions were made each day. For this, we first have to get the weekday names from the timestamp column. Since the data type is now datetime, we can apply
.dt.day_name() to get the day names. We can then group this by the new weekday column and get a count and then plot a bar plot.
Analyzing Time Series
One of the most important tips while working with time-series data is to set the datetime column as the index of the dataset. This helps us to analyze things quickly and allows us to index the dataframe using the timestamp as the index. A useful resource to learn about basic time series is this pydata talk by Ian Ozsvald.
Now, we can assign the timestamp column to be the index of the DataFrame using the following piece of code :
data.index = data[‘timestamp’]# Let’s drop the timestamp column, as we no longer need itdata.drop(columns=’timestamp’, inplace=True)data.head()
Now, one important observation we can make is that the timestamp is not a fixed interval i.e. the data is not recorded at a particular time every day. Instead, there might be multiple records for each day. Thus, to make our analysis easier, we can instead transform the dataset such that we get only 1 value per day and preferably at midnight of each day. We can then interpolate the values where a value might be NaN and move further with our analysis.
We will now create a
date_range() from start till the end of the timestamp column. This will return a range of dates every day at midnight. We can then union this with this original index. We can then
reindex() the DataFrame to this new index we created.
# pd.date_range, with daily frequency, and normalisationmidnights = pd.date_range(data.index, data.index[-1], freq='D', normalize=True)new_index = midnights.union(data.index)# .reindex() the dataframe to get an upsampled dataframeupsampled_data = data.reindex(new_index)
reindex() does is that it will match the values at the index. If there is a row of data for that index, it will return the data and where the index doesn’t match, it will return NaNs.
We can now fill the NaN values by interpolating the nearest time value. So for example, if you have a value at 8:22AM on 3rd March and a value at 11:48AM on 4th March, then the value at 4th March 12:00AM (since we have dates normalized to midnights) will be nearest to the value in the latter. This is how interpolation basically works. Read this discussion to understand more about time-series interpolation.
# .interpolate the upsampled_data using the time methodupsampled_data = upsampled_data.interpolate(method=’time’)upsampled_data.head(10)
Now, we can resample the data daily at each day midnight. This ensures that we have a fixed interval between readings.
# .resample() the upsampled dataframe,# using .asfreq() to get only exactly daily valuesdaily_data = upsampled_data.resample('D').asfreq()# Drop the contributor column, we no longer need itdaily_data = daily_data.drop(columns='contributor')daily_data.head()
Thus, as you can see, we now have an index for each day and have interpolated the values as well. This helps us make much more meaningful insights from the dataset.
Now, we can look at how many coffees are made each day. The coffees column is the number of coffees made up to that date. But to get the number of coffees made each day, we can subtract the current row from the next row. This can be done using the
.diff() method. Furthermore, we use
.shift() here because if we look at the
.diff() between a Monday and a Tuesday, those coffees are attributed to the Tuesday. However, what we want to say is "these many coffees were made at some point on the Monday", so we shift the entire series up one.
# Use .diff() on the coffees column; follow up with .shift()coffees_made = daily_data['coffees'].diff().shift(-1)# Add this as a column to the dataframedaily_data["coffees_made_today"] = coffees_made
Now, we can group by weekday and get the average of coffees made each day. This can be helpful in knowing on which days the most average coffee is made. Since this machine is in a math department, I’m predicting that coffees will be the least on weekends as it is usually off for people and the most coffees would be on days of seminars when someone comes over and gives a talk.
# .groupby weekdays, take the mean, and grab the coffees_made_today columncoffees_by_day = daily_data.groupby('weekdays')['coffees_made_today'].mean()# Plot a bar chartcoffees_by_day.plot(kind='bar')
As you can see from the graph, the least number of coffees are made on an average weekend and the most coffees are made on Wednesday which was a seminar day.
This was just a brief introduction to how data analysts approach time-series data. There is a lot more to it and you should check out this full in-depth tutorial by Quentin Caudron.
You can find the full code for this on my Github : https://github.com/hemantrattey/pydata_pandas
Please feel free to give me feedback regarding the same.
Thank you for reading and I hope it helps motivate you to go ahead and do your own analysis.
Come say Hi! to me on Twitter.
 Pydata : Introduction to data analytics with pandas
 Q. Caudron : Github Repository, https://github.com/QCaudron/pydata_pandas