More Patterns for ETL and EDA

David Masse
3 min readDec 6, 2018

--

https://github.com/davidmasse/blog-tips-two

This post for my own reference continues the themes of and earlier blog post. Please see basic Python imports there as well.

Column of Lists

Why would one have a column of lists? The lists might contain the various categorical values an observation takes on. A shirt could be ‘blue’, ‘green’ or ‘beige’ or any combination of these colors, so the row for a particular shirt could have [‘green’, ‘beige’] in the ‘color(s)’ column. But what if we want a column for each color so that, in our green-and-beige shirt’s row, the ‘beige’ column and the ‘green’ column indicate True or 1 while the ‘blue’ column indicates False or 0.

The normal Pandas .pivot() method to “un-melt” or “cast” the dataframe (as in R’s reshape package) will not work because it would make a new column for each distinct list in the original column. Instead we need a new column for each distinct term in the union of all the lists in the original column. It could be handled using sklearn’s MultiLabelBinarizer, but to keep things simpler, we use pandas.Series.str.get_dummies. We first join each list into one string with list items separated by ‘|’, the separator that str.get_dummies() then assumes when it assigns names to the new columns: cast_df = df[‘lists’].str.join(‘|’).str.get_dummies()

A cleaning step that might precede the above: use the ternary operator to change any list containing an empty string (and nothing else) to an empty list: df[‘lists’] = df[‘lists’].map(lambda one_list: (one_list, [])[one_list == [‘’]])

HTML ETL Vignette: Variable-Length Data with Alternate Labels for the Same Concept/Column

If you are extracting values one by one from a website’s code (or any data file that is flexibly organized), sometimes, for each record, there could be a variable number of values given (those omitted from the maximal set are presumed ‘NA’). To make matters worse, the keys for the values, which are harvested at the same time, may be inconsistent, e.g. the key of interest sometimes appears as the singular ‘Job’ and other times as the plural ‘Jobs.’

The function below would run many times — once for each row of the dataframe to be compiled. It first makes a dictionary that will be output to form a row in the dataframe. It then makes another dictionary (this could be defined globally or passed into the function) called switcher that translates both ‘Jobs’ and ‘Job’ to ‘jobs’ and standardizes other possible keys. The HTML is parsed with Selenium as the website in question does not accept the get requests that Beautiful Soup works with.

def get_properly_labeled_list_items():
output = {'hours': 'NA',
'jobs': 'NA',
'earned': 'NA'}
switcher = {'Total earned': 'earned',
'Jobs': 'jobs',
'Job': 'jobs',
'Hours worked': 'hours'}
# list_of_list_items is a Python list of HTML 'li' (list item)
# elements containing the values and labels we want (Selenium
# used here):
list_of_list_items = \
driver.find_element_by_class_name('cfe-aggregates') \
.find_element_by_tag_name('ul') \
.find_elements_by_tag_name('li')
# The first element was always the same in this example, and
# extracted elsewhere, so we start with the second
for li in list_of_list_items[1:]:
# Pick out the key:
label = li.find_element_by_class_name('text-muted').text
# Look up the label in the switcher dictionary to find a key in the
# dictionary called 'output', then pick out the value in the HTML
# (h3 text), and assign it to the key just looked up:
output[switcher[label]] = \
li.find_element_by_tag_name('h3').text
return output

Fancy Grouped Boxplot

Here’s a way (adapted from examples found online) to make a grouped boxplot (on a common vertical scale), with the width of the box proportional to the number of samples in that category. It also displays the number of samples in each category along the horizontal axis.

# Here we assume a two-column dataframe (box_df) with just
# the group assignment ('category') and 'measurement' as columns.
dfg = box_df.groupby('category')
# Iterate through the categories, counting the number of
# observations in each:
counts = [len(v) for k, v in dfg]
total = float(sum(counts))
categories = len(counts)
# Scale widths to the number of observations in a category as a percentage of the number of all observations:
widths = [c/total for c in counts]
cax = box_df.boxplot(column='measurement',
by='category',
figsize=(15,10))
cax.set_xticklabels(['%s\n$n$=%d'%(k, len(v)) for k, v in dfg])
# Range set to make sense for the data:
cax.set_ylim([0, 200])

--

--