Unpivot a column of delimited data with R
Previously, I’ve explained how to take a column of delimited data and extract the individual values into their own rows in Power Query (Excel and Power BI) and in Python (pandas).
Today I am expanding this mini-series by explaining how these data transformations can be achieved in R.
Once again I’ll make use of this social network usage sample data to demo the transformations.
The objective is to take the above inital data (loaded from a CSV file) and transform it to the following form:
As an extra, I will also show you how to visualize the frequency of the social networks in a bar chart, with Plotly.
If you wish to skip the explanations and jump directly to the code, feel free to visit my GitHub repository where I have all the code and sample data.
Split and unpivot data
The main focus of this demo is splitting and unpivoting the delimited data.
We can see the “Used Social Networks” column can have multiple social networks in each row (maybe it was a multiple choice question in a survey), separated by semicolons (;). This isn’t a suitable format for data analysis, as we can’t count the frequency of each individual social network.
So, the logic for extracting the individual social networks and putting them on their own rows (unpivot) is as follows:
- Split the values of each row into their own column (e.g. Facebook;Instagram are split into two columns, one for Facebook, another for Instagram)
- Take those columns with individual options and put them in a single column (unpivot those columns)
(Notice how the data in the “Respondent ID” and “Gender” columns is repeated to make sure the social networks are still respective to their respondent)
And now the R code:
We start by loading the CSV file and creating two helper variables, max_split_cols
and sep_into_cols
.
max_spit_cols
is the maximum number of socials networks chosen by a single respondent, that is, the number of columns we will need to make sure all individual options of that respondent are in their own column. This value is calculated by counting the maximum number of semicolons found in a row, plus one (if a row has two semicolons, then it means the respondent chose two plus one, three, social networks). Oh, and na.rm = TRUE
ignores NA/null values.
sep_into_cols
is the generation of names for the columns that will hold the individual options. The names are the concatenation of the string “Col” with a number. mapply
is like a for loop in the sense that it applies a function (paste
) to all values of a sequence (the numbers between 1 and sep_into_cols
). sep = ""
is an extra argument passed to paste
. In other words, mapply
will create a list of strings by concatenating “Col” with a number, leaving no sep
aration between the two (an empty string, ""
). unname
extracts the actual strings out of the list returned by mapply
(mapply
actually returns a list of named strings, similar to a dictionary in Python).
In case the textual explanation was not clear enough, the above visual should be clearer.
We can now split the delimited values into separate columns (lines 13 and 14 of the code gist). The %>%
operator passes (pipes) the preceding value to the following function call as its first argument. For example,
some_value %>% f(y)
some_value
will be interpreted as the first argument to the function call of f
, and y
as the second argument (please note this operator is provided by the dplyr
library, it’s not a built-in R operator). In our case, it passes the data
loaded from the CSV as the first argument of the separate
function. This function is the one responsible for splitting the delimited data into multiple columns.
Besides the DataFrame, we also specify the name of the column with delimited data, “Used Social Networks”, the names of the columns created for the extracted values (sep_into_cols
), the separator used to split values (a semicolon), and how to fill columns when not enough values are split. This last argument needs some attention as it specifies how to handle rows where the number of split values is less than the number of columns available (that is, the majority of rows).
If four columns are created to contain the split values (four was the highest number of social networks chosen by a single respondent), then what happens when someone only chooses one, two or three? Well the remaining columns are filled with NA, but that last argument of separate
specifies if we first fill
the columns with the extracted values (right
) or with NAs (left
). Since we chose right
, the result is this:
But if we had chosen left
this would be the result:
Okay, now that we have the values split, we just need to unpivot those columns and we have the transformations finalized. That’s what line 19 is for, using the pivot_longer
function. Aside from the split data passed as the first argument, we also need to specify the names of the columns to unpivot (having those names stored in the sep_into_cols
variable comes in handy). When unpivoting multiple columns, the names of those columns are put in one column and their values in a second column. That’s what the names_to
and values_to
arguments are used for, respectively. Lastly, we don’t want to keep those extra NA created during the splitting, hence we specifiy values_drop_na = TRUE
.
Since we don’t need the column with the names of those intermediary columns that had split values, we remove it on line 28 by “unselecting” (rather, by select
ing it with a minus sign). Hence, the resulting DataFrame is what we were looking for:
Extra: count the frequency of social networks and plot it
Now that the data is in a suitable format, it is easy to plot the frequency of each social network.
To count the social network frequencies, we simply need to group the DataFrame by that column, “Used Social Network”, and count the frequency of each unique option. Insummarize(Frequency = n())
, Frequency is the name of the column with the frequencies and n()
is the aggregate function that for counting. Unfortunately, the returned DataFrame only has two columns, one for the social networks (“Used Social Networks”) and another for the frequencies (“Frequency”), but that is enough for the plot.
On line 6 of the second code gist we order the DataFrame by descending order of frequencies.
On line 11 we finally start creating the bar chart plot with Plotly. plot_ly
receives the DataFrame of frequencies and then we pipe that function to the add_bar
function which creates a Plotly bar chart. The tildes (~) preceding the column names are used to identify those as columns of the DataFrame specified. Afterwards, we pipe yet another function, to update the formatting of the plot. We give it a title, change the axes’ titles and order the X axis’ data by the order of data in the “Frequency” column. In other words, the social networks of the X axis are ordered by descending order of their frequency!
Plotly charts are interactive, that is, you can zoom in/out of the chart, see extra information by hovering, etc., but it was only possible to add a screenshot in this article.
Conclusion
And this it the end of the article. We’ve gone through the following:
- How to split data with delimeters in a column
- How to unpivot columns
- How to count the frequency of values in a column
- How to create a Plotly bar chart
As usual, all the code for this article is available on GitHub.
Lastly, here’s the complete script :)