Pandas Tutorial: Melting

Jimmy Aidoo
Analytics Vidhya
Published in
4 min readMay 14, 2020

In his 2014 paper ‘Tidy Paper’, Statistician Hadley Wickham explained that though data comes in various shapes and formats, there was always the need to have a formal, univocal way of describing the shape of data.

He believed, among other things, that keeping data in such a format gives the Data Scientist an easier path to cleaning and analysing his/her data. Keeping data ‘Tidy’ peels off a few layers of analysis and helps us better understand how the shape of our data fits into the various components of the data analysis.

To better understand what he meant, I will be introducing you to a simple dataset I created using the English Premiership Golden Boot Race 2018/2019 The values may not be entirely accurate but please follow the argument I am developing and point I am trying to make.

DATASET I

      Name      Goals     Assists
0 Mo Salah 25 8
1 Aguero 22 12
2 Aubameyang 22 11

DATASET II

           0          1          2
Name Mo Salah Aguero Aubameyang
Goals 25 22 22
Assists 8 12 11

These two datasets are similar to examples given in Hadley Wickham’s paper. The two datasets represent the same thing but they have been presented differently. Both styles present unique difficulties to the data analyst or scientist who has been tasked to draw insights from the data. Some data formats are better for Reporting and others to make Analysis easier. So for our purposes, it is imperative that we reorganise either of these datasets so that they look tidy and help with our analysis.

There are three principles of ‘Tidy Data’. These are the rules of thumb and generally accepted as industry best practices.

  • Columns represent separate Variables
  • Rows represent individual Observations
  • Observational Units form Tables

That is, in the Tidy format, each observation will now contain the Name of Player, Goals scored and Assists provided. Making data tidy this way makes it easier to fix common problems along the way.

The problem we want to solve with the dataset provided above is to convert all columns to represent separate variables and make the rows represent individual observations.

We will be doing this using a Pandas method called MELTING.

Melting data is a process of turning columns of your data into rows of data. However, if the columns are already in the right places, Melting it will make your data untidy and you will run into new problems in your analysis.

Here, the Melting method, as the name suggests, ‘melts’ the Goals and Assists columns into one Column we will conveniently call ‘Stats’ and then creates a new column ‘Numbers’ where the numbers for each of those football statistics will be stored.

Assuming the Golden Boot racedataset is stored as ‘df’, we will first specify the dataframe we want to use and which columns we want to hold constant.

For our dataset, we want to keep ‘Name’ constant and ‘Melt’ Goals and Assists to one single column. We do this by specifying within the pd.melt() method, id_vars = 'Name' parameter. The other default parameter of pd.melt() is value_vars . We pass a list of the columns we want to melt to that parameter. value_vars = ['Goals' , 'Assists']

In [1]: pd.melt(frame=df, id_vars ='Name', 
value_vars=['Goals, 'Assists'])
Out[1]: Name variable Value
0 Mo Salah Goals 25
1 Aguero Goals 22
2 Aubameyang Goals 22
3 Mo Salah Assists 8
4 Aguero Assists 12
5 Aubameyang Assists 11

Please note that if we don't specify the columns as we did, pd.melt will use all the columns not specified in the id_vars parameter.

Also, note that the default Variable column name is not always what we want. When melting Dataframes, it makes better sense to have column names more meaningful than Variable and Value. The default names may work in certain situations but is best to always have data that is self-explanatory. We can rename the new melted columns with var_name and value_name

In [2]: pd.melt(frame=df, id_vars='Name',
value_vars=['Goals', 'Assists']
var_name ='Stats', Value_name ='Numbers'
Out[2]: Name Stats Numbers
0 Mo Salah Goals 25
1 Aguero Goals 22
2 Aubameyang Goals 22
3 Mo Salah Assists 8
4 Aguero Assists 12
5 Aubameyang Assists 11

This method seems very handy but that is not always the case. Real-world data, as you imagine, you will be working with in the future are often time very messy and untidy and nothing like the manicured dataset used in this example. Real-world data could be very cumbersome so getting a grip of this simple technique will do you a world of good when the time comes to get your hand dirty with personal projects.

I highly recommend you get some raw data and play with them. The next tutorial will be on the inverse of this method: Pivoting.

I hope you had fun reading and following this tutorial. Until next time, Happy Coding!

--

--

Jimmy Aidoo
Analytics Vidhya

CyberSecurity. Machine Learning. Bread & Circuses.