Three Routes: Convert Comma-Separated Column to Rows

Simi Talkar
The Startup
Published in
7 min readSep 9, 2020

Relevant Scenario

Consider a table with a column with customer names and another column with a comma-separated list of items bought. This scenario or structure of data is extremely common when you ask for information from the field. It could be a Deal Id in the first column and a list of Subscriptions in the next. It could be Country in the first and a list of States in the next or Parent Company as key and Child (subsidiary companies) in the next column.

To work with the data in any meaningful manner though we do need to separate out the list of comma separated data into individual rows identified by the key element.

Data table

The data we will consider here is a table called ‘checkout’
The first column is the name of a customer and the second is a comma separated list of order items. We want to go from the below in Fig 1 to Fig 2.

Fig 1:

Fig 2:

Technique in SQL

PostgreSQL
In Postgres we have a couple of options to split and expand the list into rows. One is a combination of converting the value to an array by splitting it along a supplied delimiter and then unnesting it. The second is the less efficient but straightforward method of using regexp_string to_table.

Postgres Query 1
select customer,
unnest(string_to_array(orders, ‘,’)) as order
from checkout
order by customer;
Postgres Query 2
select customer,
regexp_split_to_table(orders, ‘,’) as order
from checkout
order by customer;

While in Postgres these convenient functions make short work of breaking apart the comma separated columns, in Oracle the methodology used is little more complicated but worth understanding since in Power BI we use a similar pattern.

Oracle
In Oracle there are several ways to split a comma separated column into rows as described in the article whose link is provided below, but the one I dissect and explain in detail below uses a pattern of operation that we will also use in Power BI DAX formula:

The pattern essentially involves:
1. Detect the number of items in the comma separated list
2. Expand the original table (virtually) by adding on rows to accommodate the newly separated out items.
3. Separate out individual items, picking each out from the list and place the against the appropriate key.

Article with several options provided to split comma separated strings:

Oracle Query
select t.customer,
trim(regexp_substr(t.orders, '[^,]+', 1, lines.column_value))
as "Order"
from checkout t,
table(cast(multiset(
select level from dual
connect by level <=
regexp_count(t.orders, ',')+1
) as sys.odciNumberList
)
)lines
order by customer,
lines.column_value;

To understand the query, we will look into the results from breaking the query up. We first get the number of commas in each order for every customer. Adding a 1 to the count of commas gives us the number of orders in each comma separated list.

We then create a list of number of individual order items for each customer using psuedo-column LEVEL and psuedo-table DUAL. We need to multiply the row for each customer by the number of individual order items and so we generate a series of number of rows required for each customer.

The LEVEL is used to generate a series of numbers as in

select level from dual connect by level < 5;
1
2
3
4

The 5 seen above will be replaced by the number of order items in each row as we go through every row. This list of LEVEL numbers can be extracted from the column_value of the TABLE created as seen below. We use the built in sys.odciNumberList collection type for convenience.

And so for each customer such as Sari who has three items in her list, there are now three rows and column_value ranging from 1 to 3.

Now that we have multiplied each row by the number of order items, we can separate out the items and store each against the customer in a column using reg_substring, which picks out each item from the list using the column_value. We now have our separated out order list with customer name key.

Technique in Pandas

We start out with a dataframe created by the output of an SQL statement sent out to the PostGres database where we created the table (see Setup data section below).

df = df.assign(orderssplit =        df['orders'].str.split(',')).drop("orders", axis=1)\
.rename(columns = {"orderssplit": "orders"})

The one line command above splits the orders into a list of individual items, drops the comma separated orders column so that it can be replaced by the split of orders column.

We then “explode” this split column and we are done!!

Technique in Power BI

Setup details to reuse the data from PostGres has been provided below in the section titles Setup Data.

The pattern used in Power BI DAX is similar to the one described for Oracle database above:

The pattern essentially involves:
1. Detect the number of items in the comma separated list
2. Expand the original table (virtually) by adding on rows to accommodate the new separated out items.
3. Separate out individual items, picking each out from the list and place the against the appropriate key

Here, the number of items in each order list is determined by subtracting the length of the comma separated string without the commas from the length of comma separated string. How do we get the string without the commas? We use SUBSTITUTE as seen below:

We now need a Dummy table with a column Sequence that can be created using GenerateSeries (DAX) or by importing an excel workbook with a series of whole numbers. How many numbers? As many as you expect to be the number of individual items in the comma separated list, since this table helps ‘expand’ the original query by adding on rows for each item of the list.

We then create a calculated table using DAX’s CROSSJOIN to join every row in the checkout query to each of the rows in the Dummy table. This multiplies the number of rows in the original query radically but we will trim off the rows we do not need and limit each customer to the number of items soon, by filtering by the number of order items in each row.

And now for the filtering….

The formula below limits the ‘Sequence’ for each customer to range from 1 to the number of items in their order list. We have now effectively multiplied the number of rows for every customer by the number of items in their list to make room for each item to appear in its own row.

With PATHITEM we pick out each item by specifying the “Sequence” number. And voilà …

Setup data

To reduce effort in duplicating data in each application, I started out with the database. The command used for PostGres were tweaked to create a similar table with data in Oracle. The advantage was that I could then connect to this data from Jupyter Notebook as well as import it into Power BI.

CREATE TABLE public.checkout
(
customer character varying(50) COLLATE pg_catalog.”default”,
orders character varying(200) COLLATE pg_catalog.”default”
)
insert into checkout (customer, orders)
values ('Jignesh', 'Potato, Carrots, Turnips');
insert into checkout (customer, orders)
values ('Susan', 'Sugar, Butter, Eggs, flour');
insert into checkout (customer, orders)
values ('Sari', 'Hacksaw, Rope, Gravel')
insert into checkout (customer, orders)
values ('Masaba', 'Eyeliner, Lotion, Turmeric paste')
insert into checkout (customer, orders)
values ('Chris','Candles, Card');

To bring this data into a Jupyter notebook:

import sqlalchemy
sqlalchemy.create_engine('postgresql://<username>:<user pwd>@localhost/<name of database>')
%load_ext sql# set up credentials that you will use below
credentials = "postgresql://<username>:<user pwd>@localhost/<name of database>"
import pandas as pd
df = pd.read_sql("""
select * from checkout
""", con = credentials)

Setting up data in Power BI can be efficiently handled by using Get Data->Postgres SQL

I provided localhost as the server and the database name and was prompted for user name and password after which I was presented with the list of tables in the database. I selected the checkout table for this exercise.

--

--

Simi Talkar
The Startup

Certified DS Associate (DP-100, DA-100), pursuing Masters in University Of Michigan’s Applied Data Science Program https://www.linkedin.com/in/simi-talkar/