Unpivot Delimited Data — Python Data Analysis Series Part 4

José Fernando Costa
Jan 28 · 6 min read
Cover image
Cover image (source)

Today we reach the fourth part of this series, the last part about writing code. The fifth and last part will be about moving our notebooks to the cloud, to an Azure DataBricks workspace.

But before that, we need to analyse the programming languages used by the respondents of the 2020 Stack Overflow Developer survey. This column comes in a bad format, as the choices of each developer are put in the same column, separated by a semicolon (;). We’ll need to split those individual languages into different columns, and then put them all back in a single column, in different rows!

Before we get into the code, here are the usual handy links to get around the series:

Analyse programming language usage

First things first, this is what we are working with today:

Preview of the programming languages original format
Preview of the programming languages original format

As you can see, the first respondent chose C#, HTML/CSS and JavaScript. The second chose JavaScript and Switft. We can tell the options chosen by looking at it manually, but it is not sustainable. We need to put this data in a better format in which it is possible to analyse programmatically.

Hence what I brought up in the introduction, we’ll split each individual option into its own column, and then bring the individual languages back into a single column, each on its own row. We’ll go into further detail about each step as we go through the code.

Analyse programming language usage

The first 10 lines are the same lines you’ve seen in the previous parts of the series: import libraries, load the data and keep only the relevant columns.

As you probably noticed in the preview from earlier, there are blanks in this column. As usual, we’ll take the easy way out and simply remove them (line 12).

On line 14 we do that splitting I spoke of before.

split_languages = data["LanguageWorkedWith"].apply(lambda languages_row: pd.Series(languages_row.split(";")))

In case you’re not familiar with string splitting functions, these are functions to which you give a string of data and the delimiter, and the function returns a list of all the individual items. For example, for the string "C#;HTML/CSS;JavaScript", the call

"C#;HTML/CSS;JavaScript".split(";")

would return the list ["C#", "HTML/CSS", "JavaScript"]". This is the base operation of what we are doing on line 14 , but in the context of a DataFrame. This splitting will be applied to each row, and then the results are spread across different columns, so that each item is in its own column. The pd.Series bit is also important, as making the result a Series lets pandas spread the values into different columns, otherwise it would still be a single column, but with lists of programming languages. At this point this is what the DataFrame looks like for the first five rows:

DataFrame after splitting languages
DataFrame after splitting languages

Most of the columns are blank, but apparently someone in the survey uses 24 programming languages and so we need to have that many columns. In the case of the first respondent for example, only the first three columns are filled. That’s why we need to get everything back into a single column with the stack method (line 17).

languages = split_languages.stack()

Basically, stack goes row by row and puts all values back into the same column, ignoring the blanks (NaN). The result is a multi-level index Series where the first level is still the respondent id (not the “Respondent” column from the dataset, but the index pandas gave it when we loaded the file) and the second level is the order of each language chosen by that person.

Unpivoted delimited data
Unpivoted delimited data

This way we could still relate back to the original dataset to perform deeper analysis of the results, but that is out of the scope for this demo. Be sure to read my tutorial on the matter if you want to learn how to do that.

Now, we need to manipulate those indices, not because we will analyse them, but because they will be useful to count the languages’ frequencies. A simple reset_index will solve that matter, making the two-level index two columns for the now DataFrame (line 19). Oh, and we can also rename the columns to avoid the automatic names.

languages = languages.reset_index()
languages.columns = ["Respondent ID", "Language Reported Order", "Languages"]

At this point, this is what the DataFrame looks like:

Resulting DataFrame
Resulting DataFrame

I did not show screenshots in the previous parts of the series because I thought the transformations were easy to accompany. However, splitting and unpivoting (stack) data is harder to visualize mentally, so I decided to show each step of the transformations.

The last thing we need before plotting the visualization is to calculate the frequencies. This time we’ll finally use the groupby function to perform the aggregation (line 22).

languages_counts = languages[["Languages", "Respondent ID"]]\     
.groupby(by=["Languages"])\
.count()\
.sort_values("Respondent ID", ascending=False)

After getting the frequencies with count, we also sort the values in descending order, so the bar chart shows the most frequent languages on the left side.

languages_counts = languages_counts.head(n=15)

The very last thing we need is to grab only the top 15 most frequent languages (line 27). Plotting all languages would not be very helpful for someone reading the visual, so it’s better to keep only the most relevant ones. You could even argue 15 is too much. At any rate, the head method is enough to get the top 15 rows from the DataFrame.

Finally we plot the bar chart on line 29 and give it some formatting afterwards.

Most used programming languages
Most used programming languages

And with this we’ve finished not only part 4 of the series but also all the notebooks and/or writing code. The next part will be completely focused on moving the four notebooks to an Azure Databricks workspace and making just a few tweaks in the code to make it work in the cloud.

As review of today’s demo, we took survey replies that were stored in the same cell, split them and put them back in a single column, all while making sure the indices could be related back to the original indices! Technically speaking, we split delimited data and unpivoted it back into a single column.

As per custom, I leave you with links to the other parts of the series and the notebooks:

The Startup

Get smarter at building your thing. Join The Startup’s +793K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +793K followers.

José Fernando Costa

Written by

I write about data science to help other people who might come across the same problems

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +793K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store