Quick Guide to Tidy Data with Pandas
“Like families, tidy datasets are all alike but every messy dataset is messy in its own way.” — Hadley Wickham
What is tidy data?
Tidy data is a theory of the best practices for how to structure datasets. For an in depth guide, read the original “Tidy Data” paper written by Hadley Wickham of R Studio, published in the Journal of Statistical Software.
Otherwise, read on for a simplified overview you can put into practice right away.
Main Tenants of Tidy Data
Tidy data is centered around three integral rules.
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
As with most concepts, I find this easier to understand through examples.
Example of Messy Data
Example of Tidy Data
Why is Tidy Data important?
Perhaps the best benefit of using tidy data is that it’s easiest to work with. As it’s often said, data cleaning takes around 80% of a data scientist’s time on every project — so why not make things easier on yourself?
Additionally, having one consistent way of structuring the datasets you work with will save you time. Every time you start a new project you will already have an idea of how you want to structure your data.
Using Pandas to Tidy Your Data
As you might have suspected or experienced already, most data found “in the wild” will not already be stored in a Tidy Data format. That’s where you and your Python + Pandas skills come in.
Though there are many built-in Pandas functions you may use to tidy your data, we will focus on the function most commonly used — pd.melt( ).
pd.melt( )
The Pandas pd.melt( ) function is used to change data from a wide to long format. This is one of the most common steps necessary for transforming messy data into Tidy Data.
Key parameters:
id_vars : The columns you want to keep as identifiers (anything you don’t want melted/consolidated).
value_vars : The columns you want to melt (you can leave this out if you specified what not to melt with id_vars ).
value_name : Name for the value you are observing — this will default to ‘value’ if no parameter is given.
var_name : Name for the variable you are observing — this will default to ‘variable’ if no parameter is given.
Example #1
Before .melt()
Code
# applying pd.melt to consolidate the year + population columnsdf.melt(id_vars=['Country'], value_name='Population', var_name='Year')
After .melt()
Example #2
Before .melt()
Code
# renaming columns before melting to be used as the 'region' variabledf.rename(columns={'na':'North America', 'eu':'European Union','jp':'Japan','other':'Other'}, inplace=True)# applying pd.melt() to the regions/sales columnsdf.melt(id_vars=['game', 'release_year', 'genre'], value_name='sales', var_name='region')
After .melt()
By following these examples, you’ll be well on your way to tidying any dataset you come across.
Thank you for reading. If you found this helpful, please like and subscribe for more content.