Unpivot a column of delimited data with R

Published in

Analytics Vidhya

7 min readJul 15, 2020

Previously, I’ve explained how to take a column of delimited data and extract the individual values into their own rows in Power Query (Excel and Power BI) and in Python (pandas).

Today I am expanding this mini-series by explaining how these data transformations can be achieved in R.

Once again I’ll make use of this social network usage sample data to demo the transformations.

The objective is to take the above inital data (loaded from a CSV file) and transform it to the following form:

As an extra, I will also show you how to visualize the frequency of the social networks in a bar chart, with Plotly.

If you wish to skip the explanations and jump directly to the code, feel free to visit my GitHub repository where I have all the code and sample data.

Split and unpivot data

The main focus of this demo is splitting and unpivoting the delimited data.

We can see the “Used Social Networks” column can have multiple social networks in each row (maybe it was a multiple choice question in a survey), separated by semicolons (;). This isn’t a suitable format for data analysis, as we can’t count the frequency of each individual social network.

So, the logic for extracting the individual social networks and putting them on their own rows (unpivot) is as follows:

Split the values of each row into their own column (e.g. Facebook;Instagram are split into two columns, one for Facebook, another for Instagram)
Take those columns with individual options and put them in a single column (unpivot those columns)

(Notice how the data in the “Respondent ID” and “Gender” columns is repeated to make sure the social networks are still respective to their respondent)

And now the R code:

Split and unpivot data

We start by loading the CSV file and creating two helper variables, max_split_cols and sep_into_cols.

max_spit_cols is the maximum number of socials networks chosen by a single respondent, that is, the number of columns we will need to make sure all individual options of that respondent are in their own column. This value is calculated by counting the maximum number of semicolons found in a row, plus one (if a row has two semicolons, then it means the respondent chose two plus one, three, social networks). Oh, and na.rm = TRUE ignores NA/null values.

sep_into_cols is the generation of names for the columns that will hold the individual options. The names are the concatenation of the string “Col” with a number. mapply is like a for loop in the sense that it applies a function (paste) to all values of a sequence (the numbers between 1 and sep_into_cols). sep = "" is an extra argument passed to paste. In other words, mapply will create a list of strings by concatenating “Col” with a number, leaving no separation between the two (an empty string, ""). unname extracts the actual strings out of the list returned by mapply (mapply actually returns a list of named strings, similar to a dictionary in Python).

In case the textual explanation was not clear enough, the above visual should be clearer.

We can now split the delimited values into separate columns (lines 13 and 14 of the code gist). The %>% operator passes (pipes) the preceding value to the following function call as its first argument. For example,

some_value %>% f(y)

some_value will be interpreted as the first argument to the function call of f, and y as the second argument (please note this operator is provided by the dplyr library, it’s not a built-in R operator). In our case, it passes the data loaded from the CSV as the first argument of the separate function. This function is the one responsible for splitting the delimited data into multiple columns.

Besides the DataFrame, we also specify the name of the column with delimited data, “Used Social Networks”, the names of the columns created for the extracted values (sep_into_cols), the separator used to split values (a semicolon), and how to fill columns when not enough values are split. This last argument needs some attention as it specifies how to handle rows where the number of split values is less than the number of columns available (that is, the majority of rows).

If four columns are created to contain the split values (four was the highest number of social networks chosen by a single respondent), then what happens when someone only chooses one, two or three? Well the remaining columns are filled with NA, but that last argument of separate specifies if we first fill the columns with the extracted values (right) or with NAs (left). Since we chose right, the result is this:

But if we had chosen left this would be the result:

Okay, now that we have the values split, we just need to unpivot those columns and we have the transformations finalized. That’s what line 19 is for, using the pivot_longer function. Aside from the split data passed as the first argument, we also need to specify the names of the columns to unpivot (having those names stored in the sep_into_cols variable comes in handy). When unpivoting multiple columns, the names of those columns are put in one column and their values in a second column. That’s what the names_to and values_to arguments are used for, respectively. Lastly, we don’t want to keep those extra NA created during the splitting, hence we specifiy values_drop_na = TRUE.

Since we don’t need the column with the names of those intermediary columns that had split values, we remove it on line 28 by “unselecting” (rather, by selecting it with a minus sign). Hence, the resulting DataFrame is what we were looking for:

Extra: count the frequency of social networks and plot it

Now that the data is in a suitable format, it is easy to plot the frequency of each social network.

Count frequencies and plot

To count the social network frequencies, we simply need to group the DataFrame by that column, “Used Social Network”, and count the frequency of each unique option. Insummarize(Frequency = n()), Frequency is the name of the column with the frequencies and n() is the aggregate function that for counting. Unfortunately, the returned DataFrame only has two columns, one for the social networks (“Used Social Networks”) and another for the frequencies (“Frequency”), but that is enough for the plot.

On line 6 of the second code gist we order the DataFrame by descending order of frequencies.

On line 11 we finally start creating the bar chart plot with Plotly. plot_ly receives the DataFrame of frequencies and then we pipe that function to the add_bar function which creates a Plotly bar chart. The tildes (~) preceding the column names are used to identify those as columns of the DataFrame specified. Afterwards, we pipe yet another function, to update the formatting of the plot. We give it a title, change the axes’ titles and order the X axis’ data by the order of data in the “Frequency” column. In other words, the social networks of the X axis are ordered by descending order of their frequency!

Plotly charts are interactive, that is, you can zoom in/out of the chart, see extra information by hovering, etc., but it was only possible to add a screenshot in this article.

Conclusion

And this it the end of the article. We’ve gone through the following:

How to split data with delimeters in a column
How to unpivot columns
How to count the frequency of values in a column
How to create a Plotly bar chart

As usual, all the code for this article is available on GitHub.

Lastly, here’s the complete script :)

Complete script

Unpivot a column of delimited data with R

Split and unpivot data

Extra: count the frequency of social networks and plot it

Conclusion

Written by José Fernando Costa