A beginner’s guide to melting dataframes

Published in

Afro AI

6 min readMay 18, 2021

As we’ve all come to learn, data preparation is a very important step in exploratory data analysis (EDA) before proceeding to any kind of visualization. An ill-defined data preparation procedure poses the risk of not getting sufficient insights from the data being investigated. Have you ever looked at your dataframe and thought, I have to melt these columns here? Well, for me it was not so when I started working on the dataset from Kaggle State of Data Science survey. But I hope it will be intuitive for you after after you’ve had your dose of the melt function in this article. The article is inspired by my first publication titled Comparing The State Of Data Science in Africa to Other Continents. I provide a beginner’s guide to melting dataframes with categorical features, when and how to melt dataframes and what I learnt in the process.

UNDERSTANDING THE DATA STRUCTURE

The data structure is the key determinant of whether or not we’re going to need melting. For reference, let’s consider the dataset from the Kaggle State of Data Science Survey. On scrutiny, we find that it comprises categorical data in which the columns Q1-Q6 were responses to multiple choice questions where only a single choice could be selected — these were more of demographic questions. For example, question 6 which asked, ‘For how many years have you been writing code and/or programming?’, one could only choose one answer because they can only belong to one group in these choices: I have never written code, <1 year, 1–2 years, 3–5 years, 5–10 years, 10–20 years and 20+ years.

However, from questions 7–39, the respondents were allowed to select more than one answer from the multiple choice questions . For example, question 7 which asked, ‘What programming language do you use on a regular basis?’, had 12 options including Python, R, SQL, C, C++, Java, Javascript, Julia, Swift, Bash, MATLAB, None, Other — one could select any of the choices that applied to them. Each choice was split into multiple columns — with one column per answer choice and were thus labelled Q7_Part_1, Q7_Part_2 until Q7_OTHER to accommodate all answer choices. See the head of the dataframe below which demonstrate the described data structure:

ZOOMING INTO THE MULTIPLE ANSWER QUESTIONS

When a dataframe is structured like this, one approach for drawing a bar plot with all the categories (answer choices) requires generating a list of all the columns for the survey question of interest. And then put them together with another desired column using a groupby function, then transpose the resulting dataframe (see diagram below). This will achieve the expected bar plot, however notice that you still need to rename the corresponding xtick labels (thus, make another list of the unique values in each column).

This approach is tedious and definitely not time effective as it requires long lines of code. It may limit the manipulation one can do for a particular trend and the nature of relationships one may desire to have in one plot. The pandas melt function comes in very handy when tackling a situation like this.

THE MELT FUNCTION

To remedy the multiple choice answers, I used a melt function which helped to put all the answer choices in one column next to their corresponding question label.

By definition a pandas melt function is used to unpivot dataframes from a wide long format, optionally leaving identifier variables set (see docs). More information about the melt operation can be found in the publication by the renowned data guru Hadley Wickham titled tidy dataset.

Some parameters of the melt function are frame, id_vars, value_vars, var_name, value_name. Frame is the DataFrame you wish to unpivot. Id_vars are columns to use as identifier variables. For instance, these are columns from your DataFrame which will remain unchanged. Value_vars are columns to unpivot i.e. columns with similar information that you want to put into one column, such as the 12 programming languages of question 7 which were split into 12 unique columns as seen earlier. The function will put Q7_Part_1, Q7_Part_2 until Q7_OTHER in one column and the programming languages (Python, R, SQL, C etc) in their own column next to the corresponding question label. Thus, value_vars introduces two new columns which by default are labelled ‘variable’ and ‘value’. To customize variable and value column labels you can specify var_name for the variable column and value_name for the value column. If value_vars parameter is not specified the function will unpivot every other column that is not set in id_vars.

The diagram below illustrate how I used the melt function in code. The id_vars are the single choice responses of the questions and value_vars is a list of the questions I wanted to unpivot. It is important to note that id_vars and value_vars takes None variables — either a tuple, list or ndarray.

Take a look at the head of the DataFrame showed earlier and compare with this new one. Notice that the columns set as identifier variables are just as they were in the original DataFrame and a list of the columns specified in value_vars are unpivoted to row axis, leaving two non-identifier columns, variable and value.

This new format of a DataFrame allowed me to draw multi-index bar plots easily and more effectively such as the one below.

DRIVE THE MELT KNOWLEDGE HOME WITH A SIMPLE EXAMPLE

In the simple example below, we have a DataFrame with four columns: Group_ID, Female, Male and Other. Since Female, Male and Other are gender groups containing similar data, we can put them into one column instead of having them in separate columns. Group_ID is set as an identifier variable and the gender groups are the value variables to be unpivoted. The .melt() function from pandas library puts columns specified in the value_vars into one column (Variable) and the corresponding values next to it (Value). The variable and value name have been renamed to gender_group and count respectively.

In the new format, we have a tidy dataset, which is much easier to work with as similar data is put into the same column. One can now draw a plot directly from the dataframe without having to make lists or use a groupby function as seen earlier.

WHAT I LEARNT AND A TAKE HOME FOR YOU

The questions I asked for my analysis required separating data into Africa and continental data. Melting the columns allowed more flexibility with how I could ask questions, and thus what I could extract insights from the data. The melt function made it easier to interact with the DataFrame directly without further manipulations. In addition, most of my time was saved to having more fun making different plots and telling a story from them.

The most important lesson to take home from this article is that, when working with data, it’s important to have the data in a format that will allow you flexibility in asking different question and making different visualization. Your data structure as a data scientist should never be the limit to the problems you want to solve, or the message you want to get out the world. Have fun while you’re at it! Furthermore, it is important to respect the tidy dataset format and understand the flexibility it affords in your analysis.

Resources

2020 Kaggle Machine Learning & Data Science Survey | Kaggle

pandas.melt — pandas 1.2.4 documentation (pydata.org)

Hadley Wickham

Tidy Data (had.co.nz)