Sitemap
Dunder Data

Expert python data science training — take a course at dunderdata.com

Press enter or click to view image in full size

Use the Pandas String-Only get_dummies Method to Instantly Restructure your Data

3 min readOct 24, 2022

--

In this post, you’ll learn how to use the fantastic str.get_dummies Pandas Series method to instantly restructure trapped data within a string. We begin by reading in a small sample dataset containing people's favorite fruits.

import pandas as pd
df = pd.read_csv('data/fruits.csv')
df
png

Notice that the fruit column has multiple fruit names in each cell separated by a pipe character. While this format compactly represents the data, it isn’t the most suitable for answering basic questions such as:

  • What is the number of fruit per person?
  • How many people enjoy banana?
  • Which people enjoy both oranges and bananas?
  • How any fruits are in-common with each person?

Attempt to answer questions in current form

It’s possible to answer these questions in the current format using Pandas, though, we will see how these ways are sub-optimal. Here we find the number of fruit per person by adding one to the count of the pipe characters.

s = df.set_index('name')['fruits']
s.str.count(r'\|') + 1
name
Ana 5
Bill 2
Calvin 2
Dean 2
Elias 3
Felicia 2
George 5
Henry 2
Name: fruits, dtype: int64

The number of people who enjoy banana.

s.str.contains('banana').sum()3

The people that enjoy both oranges and banana.

s.str.contains('(?=.*orange)(?=.*banana)')name
Ana True
Bill False
Calvin False
Dean False
Elias False
Felicia False
George True
Henry True
Name: fruits, dtype: bool

Finding the number of fruits in-common with both people is particularly difficult and is a clear case for reformatting the data.

Better formatting of the data

All of these questions can be bettered answered if the data is in a different format. The get_dummies string-only method will split all values in a single cell into their own columns creating 0/1 indicator variables. Here, we pass in the pipe character to get_dummies, producing the following DataFrame.

df1 = s.str.get_dummies('|')
df1
png

We can now answer the same questions as before. Here, we count the number of fruit for each person.

df1.sum(axis=1)name
Ana 5
Bill 2
Calvin 2
Dean 2
Elias 3
Felicia 2
George 5
Henry 2
dtype: int64

We sum up a single column to count the total number of people who enjoy bananas.

df1['banana'].sum()3

Here, we use the query method to select each person who likes both oranges and bananas.

df1.query('orange + banana == 2')
png

Finding the number of fruits in-common with each person is where the largest gain from restructuring comes from. Here, we use the matrix multiplication operator to multiply the DataFrame to itself.

df1 @ df1.T
png

Master Data Analysis with Python

If you enjoyed this tip and are looking to become an expert with Pandas, then check out my extremely comprehensive book, Master Data Analysis with Python. It is the most comprehensive Pandas book available, comes with 500+ exercises, video tutorials, and certification exams.

--

--

Dunder Data
Dunder Data

Published in Dunder Data

Expert python data science training — take a course at dunderdata.com

Ted Petrou
Ted Petrou

Written by Ted Petrou

Author of Master Data Analysis with Python and Founder of Dunder Data

No responses yet