Creating non-numeric pivot tables with Python Pandas

If you ever tried to pivot a table containing non-numeric values, you have surely been struggling with any spreadsheet app to do it easily. 
There is, apparently, a VBA add-in for excel. Or you’ll have to use MS Access, which should be fine for these kind of operations. But I didn’t test these options myself so anything could be.

The problem with spreadsheets is that by default they aggregate or sum your data, and when it comes to strings there usually is no straightforward workaround.

However, if you wanna do it with 9 (nine!) lines of code, then a panda is your friend :)

In my case, the raw data was shaped like this:

id, key, value
id1, size, 55
id1, material, wood
id2, size, 100
id2, material, iron
id2, shape, round

And I wanted to pivot it to:

id, size, material, shape
id1, 55, wood, ,
id2, 100, iron, round

Here’s the code.

import pandas as pd
import csv
df = pd.read_csv(‘data_raw.csv’)
pivot_table = df.pivot_table(index=[‘id’],
aggfunc=lambda x: ‘ ‘.join(str(v) for v in x))

The big point is the lambda function. To return strings it’s usually set as:

lambda x: ‘ ‘.join(x)

But this will return a boolean. In the aggfunc field you’ll need to use that small loop to return every specific value.

I reckon this is cool (hence worth sharing) for three reasons:

  • Works with big amounts of data
  • You don’t have to worry about heterogeneity of keys (it will just be a column more in your results!)
  • It’s fast and open-source! :)

Hanging out with the big guys?

If you’re working with large datasets this method will return a memory error. You can avoid it (I used it on a 15gb dataset) reading your dataset chunk by chunk, like this:

### Beware of the Panda. 
### 'You have no idea how tidy this is gonna get'
from pandas import *
import csv
import numpy as np
df = pandas.read_csv(‘data_raw.csv’, sep=” “, chunksize=5000)
appended_data = []
for chunk in df:
pivot_table = chunk.pivot_table(index=[‘id’],
aggfunc=lambda x: ‘ ‘.join(str(v) for v in x))
appended_data = pandas.concat(appended_data, axis=0).reset_index()
appended_data.to_csv('data_clean.csv', sep=",")
#and if you wanna clean it a little bit where the chunk trunks it:
appended_data_clean = appended_data.groupby('id', sort=True).agg(np.sum)
appended_data_clean.to_csv('dati_clean_crunched.csv', sep=",")

(sometimes I write on my blog, here)