Quick Guide to Tidy Data with Pandas

Stetson Done, MSDA

Follow

Published in

Data Digest

4 min readOct 26, 2022

--

“Like families, tidy datasets are all alike but every messy dataset is messy in its own way.” — Hadley Wickham

Abstract graphic of people holding magnifying glasses looking at scribbled colorful lines

What is tidy data?

Tidy data is a theory of the best practices for how to structure datasets. For an in depth guide, read the original “Tidy Data” paper written by Hadley Wickham of R Studio, published in the Journal of Statistical Software.

Otherwise, read on for a simplified overview you can put into practice right away.

Main Tenants of Tidy Data

Tidy data is centered around three integral rules.

Every column is a variable.
Every row is an observation.
Every cell is a single value.

Three visual examples of how each column is a variable, each row is an observation, and each cell is a single value — Image source: Hadley Wickham’s Tidy Data - https://r4ds.had.co.nz/tidy-data.html

As with most concepts, I find this easier to understand through examples.

Example of Messy Data

dataframe showing messy data. caption contains more info — In this example, the variable for population is contained across several columns, breaking the “every column is a variable” rule. A Tidy Data approach would be the variable ‘population’ in one column with rows representing the observations each year.

Example of Tidy Data

Here the data has been consolidated to follow the rules of Tidy Data with ‘Population’ contained in a single column and ‘Year’ contained in another column. Each column now contains only one variable and each row represents an observation.

Why is Tidy Data important?

Perhaps the best benefit of using tidy data is that it’s easiest to work with. As it’s often said, data cleaning takes around 80% of a data scientist’s time on every project — so why not make things easier on yourself?

Additionally, having one consistent way of structuring the datasets you work with will save you time. Every time you start a new project you will already have an idea of how you want to structure your data.

Using Pandas to Tidy Your Data

As you might have suspected or experienced already, most data found “in the wild” will not already be stored in a Tidy Data format. That’s where you and your Python + Pandas skills come in.

Though there are many built-in Pandas functions you may use to tidy your data, we will focus on the function most commonly used — pd.melt( ).

pd.melt( )

The Pandas pd.melt( ) function is used to change data from a wide to long format. This is one of the most common steps necessary for transforming messy data into Tidy Data.

Key parameters:

id_vars : The columns you want to keep as identifiers (anything you don’t want melted/consolidated).

value_vars : The columns you want to melt (you can leave this out if you specified what not to melt with id_vars ).

value_name : Name for the value you are observing — this will default to ‘value’ if no parameter is given.

var_name : Name for the variable you are observing — this will default to ‘variable’ if no parameter is given.

Example #1

Before .melt()

Code

# applying pd.melt to consolidate the year + population columnsdf.melt(id_vars=['Country'], value_name='Population', var_name='Year')

After .melt()

Example #2