Introduction to Data Wrangling

Packt_Pub
Analytics Vidhya
Published in
7 min readNov 5, 2019
Photo by Mika Baumeister on Unsplash

When we perform data wrangling, we are taking our input data from its original state to a format where we can perform meaningful analysis on it. Data manipulation is another way to refer to this process. There is no set list or order of operations; the only goal is that the data post-wrangling is more useful to us than when we started.

In practice, there are three common tasks involved in the data wrangling process:

  • Data cleaning
  • Data transformation
  • Data enrichment

This article is an excerpt taken from the book Hands-On Data Analysis with Pandas by Packt Publishing, written by Stefanie Molin. This book covers the basic understanding of how data analysts and scientists gather and analyze data, machine learning (ML) algorithms to identify patterns and much more. In this article, we will learn the basics of data wrangling and its process.

Data cleaning

An initial round of data cleaning on our dataframe will often give us the bare minimum we need to start exploring our data. Some essential data cleaning tasks to master include the following:

· Renaming

· Sorting and reordering

· Data type conversions

· Deduplicating data

· Addressing missing or invalid data

· Filtering to the desired subset of data

Data cleaning is the best starting point for data wrangling since having the data stored as the correct data types and easy-to-reference names will open up many avenues for exploration and wrangling opportunities, such as summary statistics, sorting, and filtering.

Data transformation

Frequently, we will reach the data transformation stage after some initial data cleaning, but it is entirely possible that our dataset is unusable in its current shape, and we must restructure it before attempting to do any data cleaning. In data transformation, we focus on changing our data’s structure to facilitate our downstream analyses; this usually involves changing which data goes along the rows and which goes down the columns.

Most data we will find is either in a wide format or a long format; each of these formats has its merits, and it’s important to know which one we will need for our analysis. Often, people will record and present data in the wide format, but there are certain visualizations that require the data to be in the long format:

The wide format is preferred for analysis and database design, while the long format is considered poor design because each column should be its own data type and have a singular meaning. However, in cases where new fields will be added (or old ones removed) from a table in a relational database, rather than have to alter all the tables each time, the database’s maintainers may decide to use the long format. This allows them to provide a fixed schema for users of the database, while being able to update the data it contains as needed. When building an API, the long format may be chosen if flexibility is required. Perhaps the API will provide a generic response format (date, field name, and field value) that can support various tables from a database. This may also have to do with making the response easier to form, depending on how the data is stored in the database the API uses. Since we will find data in both of these formats, it’s important we understand how to work with both of them and go from one to the other.

import matplotlib.pyplot as plt
import pandas as pd

wide_df = pd.read_csv(‘data/wide_data.csv’, parse_dates=[‘date’])
long_df = pd.read_csv(‘data/long_data.csv’, usecols=[‘date’, ‘datatype’, ‘value’], parse_dates=[‘date’])
[[‘date’, ‘datatype’, ‘value’]] # sort columns

The wide data format

With wide format data, we represent measurements of variables with their own columns, and each row represents an observation of those variables. This makes it easy for us to compare variables across observations, get summary statistics, perform operations, and present our data; however, some visualizations don’t work with this data format because they may rely on the long format to split, size, and/or color the plot content.

Let’s look at the top six observations from the wide format data in wide_df:

wide_df.head(6)

Each column contains the top six observations of a specific class of temperature data in degrees Celsius — maximum temperature (TMAX), minimum temperature (TMIN), and temperature at time of observation (TOBS) — at a daily frequency:

When working with wide format data, we can easily grab summary statistics on this data (saved as the wide_df variable) by using the describe() method:

wide_df.describe(include=’all’)

With hardly any effort on our part, we get summary statistics for the dates, maximum temperature, minimum temperature, and temperature at time of observation:

As we discussed previously, the summary data in the preceding table is easy to obtain and is informative. This format can easily be plotted with pandas as well, provided we tell it exactly what we want to plot:

wide_df.plot(kind='line', y=[‘TMAX’, ‘TMIN’, 'TOBS'], x='date', title='Temperature in NYC in October 2018', figsize=(15, 5)).set_ylabel(‘Temperature in Celsius’)plt.show()

Pandas plots the daily maximum temperature, minimum temperature, and temperature at time of observation as their own lines on a single line plot:

Don’t worry about understanding the visualization code right now; it’s here just to illustrate how each of these data formats can make certain tasks easier or harder..

The long data format

We can look at the top six rows of the long format data in long_df to see the differences between wide format and long format data:

long_df.head(6)

Long format data will have a row for each observation of a variable; this means that, if we have three variables being measured daily, we have three rows for each day we record observations. The long format setup can be achieved by turning the variable column names into a column where the data is the variable name and putting their values in a separate values column:

Notice how in the preceding table we now have three entries for each date, and the datatype column tells us what the data in the value column is for that row. If we try to get summary statistics, like we did with the wide format (on long_df, the long format data), we don’t get useful information back — it’s not helpful to know the average of all the minimum, maximum, and other temperature observations:

long_df.describe(include=’all’)

The value column shows us summary statistics, but this is summarizing the daily maximum temperatures, minimum temperatures, and temperatures at time of observation. The maximum will be the maximum of the daily maximum temperatures and the minimum will be the minimum of the daily minimum temperatures. This means that this summary data is useless:

This format is not very easy to digest and certainly shouldn’t be how we present data; however, it makes it easy to create visualizations where our plotting library can color lines by the name of the variable, size the dots by the values of a certain variable, and perform splits for faceting.

import seaborn as sns
sns.set(rc={‘figure.figsize’:(15, 5)}, style=’white’)
ax = sns.lineplot(data=long_df, hue=’datatype’, y=’value’, x=’date’)
ax.set_ylabel(‘Temperature in Celsius’)
ax.set_title(‘Temperature in NYC in October 2018’)
plt.show()

Seaborn is able to subset based on the datatype column to give us individual lines for the daily maximum temperature, minimum temperature, and temperature at time of observation:

Seaborn lets us specify the column to use for hue, which colored the lines by the temperature type. We aren’t limited to this, though; with the long data, we can easily facet our plots:

sns.set(rc={‘figure.figsize’:(20, 10)}, style=’white’, font_scale=2)
g = sns.FacetGrid(long_df, col=”datatype”, height=10)
g = g.map(plt.plot, “date”, “value”)
g.set_titles(size=25)
g.set_xticklabels(rotation=45)
plt.show()

Seaborn is able to use the long format data to create subplots for each distinct datatype:

While it is possible to create a similar plot to the preceding one with pandas and matplotlib using subplots, more complicated combinations of facets will make using seaborn infinitely easier.

Data enrichment

When we’re looking to enrich the data, we can either merge new data with the original data (by appending new rows or columns) or use the original data to create new data. The following are ways to enhance our data using the original data:

· Adding new columns: Using functions on the data from existing columns to create new values

· Binning: Turning continuous data or discrete data with many distinct values into range buckets, which makes the column discrete while letting us control the number of possible values in the column

· Aggregating: Rolling up the data and summarizing it

· Resampling: Aggregating time series data at specific intervals

To summarize, in this article, we learned the basics of data wrangling and its process. To know more data wrangling in detail, check out the book Hands-On Data Analysis with Pandas by Packt Publishing.

--

--