Today we reach the fourth part of this series, the last part about writing code. The fifth and last part will be about moving our notebooks to the cloud, to an Azure DataBricks workspace.
But before that, we need to analyse the programming languages used by the respondents of the 2020 Stack Overflow Developer survey. This column comes in a bad format, as the choices of each developer are put in the same column, separated by a semicolon (;). We’ll need to split those individual languages into different columns, and then put them all back in a single column, in different rows!
Before we get into the code, here are the usual handy links to get around the series:
- Part 1: Analyse the distribution of ages
- Part 2: Plot an histogram of annual compensations
- Part 3: Analyse the education level of respondents
- Part 5: Move your Jupyter notebooks to an Azure DataBricks workspace
- Jupyter notebook for this article
Analyse programming language usage
First things first, this is what we are working with today:
Hence what I brought up in the introduction, we’ll split each individual option into its own column, and then bring the individual languages back into a single column, each on its own row. We’ll go into further detail about each step as we go through the code.
The first 10 lines are the same lines you’ve seen in the previous parts of the series: import libraries, load the data and keep only the relevant columns.
As you probably noticed in the preview from earlier, there are blanks in this column. As usual, we’ll take the easy way out and simply remove them (line 12).
On line 14 we do that splitting I spoke of before.
split_languages = data["LanguageWorkedWith"].apply(lambda languages_row: pd.Series(languages_row.split(";")))
In case you’re not familiar with string splitting functions, these are functions to which you give a string of data and the delimiter, and the function returns a list of all the individual items. For example, for the string
would return the list
pd.Series bit is also important, as making the result a Series lets pandas spread the values into different columns, otherwise it would still be a single column, but with lists of programming languages. At this point this is what the DataFrame looks like for the first five rows:
Most of the columns are blank, but apparently someone in the survey uses 24 programming languages and so we need to have that many columns. In the case of the first respondent for example, only the first three columns are filled. That’s why we need to get everything back into a single column with the
stack method (line 17).
languages = split_languages.stack()
stack goes row by row and puts all values back into the same column, ignoring the blanks (NaN). The result is a multi-level index Series where the first level is still the respondent id (not the “Respondent” column from the dataset, but the index pandas gave it when we loaded the file) and the second level is the order of each language chosen by that person.
This way we could still relate back to the original dataset to perform deeper analysis of the results, but that is out of the scope for this demo. Be sure to read my tutorial on the matter if you want to learn how to do that.
Now, we need to manipulate those indices, not because we will analyse them, but because they will be useful to count the languages’ frequencies. A simple
reset_index will solve that matter, making the two-level index two columns for the now DataFrame (line 19). Oh, and we can also rename the columns to avoid the automatic names.
languages = languages.reset_index()
languages.columns = ["Respondent ID", "Language Reported Order", "Languages"]
At this point, this is what the DataFrame looks like:
I did not show screenshots in the previous parts of the series because I thought the transformations were easy to accompany. However, splitting and unpivoting (
stack) data is harder to visualize mentally, so I decided to show each step of the transformations.
The last thing we need before plotting the visualization is to calculate the frequencies. This time we’ll finally use the
groupby function to perform the aggregation (line 22).
languages_counts = languages[["Languages", "Respondent ID"]]\
.sort_values("Respondent ID", ascending=False)
After getting the frequencies with
count, we also sort the values in descending order, so the bar chart shows the most frequent languages on the left side.
languages_counts = languages_counts.head(n=15)
The very last thing we need is to grab only the top 15 most frequent languages (line 27). Plotting all languages would not be very helpful for someone reading the visual, so it’s better to keep only the most relevant ones. You could even argue 15 is too much. At any rate, the
head method is enough to get the top 15 rows from the DataFrame.
Finally we plot the bar chart on line 29 and give it some formatting afterwards.
And with this we’ve finished not only part 4 of the series but also all the notebooks and/or writing code. The next part will be completely focused on moving the four notebooks to an Azure Databricks workspace and making just a few tweaks in the code to make it work in the cloud.
As review of today’s demo, we took survey replies that were stored in the same cell, split them and put them back in a single column, all while making sure the indices could be related back to the original indices! Technically speaking, we split delimited data and unpivoted it back into a single column.
As per custom, I leave you with links to the other parts of the series and the notebooks: