Notebooks for Spreadsheet Users

Dreading pivot tables? Reach for a Python Notebook + PixieDust instead.

Glynn Bird

Published in

Center for Open Source Data and AI Technologies

8 min readOct 24, 2017

I’m not a power spreadsheet user. I just use them to:

Load some data (from a CSV/TSV file)
Do a few calculations (sum, average, etc.)
Draw some charts

Spreadsheets are perfect for these tasks, but they quickly become ungainly. Try loading a data set with one million rows into your Excel, Google Sheets or OpenOffice Calc sheet — or even 100k rows. With datasets larger than a few thousand rows, it makes little sense for all of the data to be on screen all of the time.

In this post, I’ll show you how you can use Jupyter Notebooks with the Pixiedust library to replace spreadsheet tasks for medium and large data sets. The “top-to-bottom” readability of notebooks allows you to show your working out, explaining to the reader what each step is doing. Your notebook is your proof that your data analysis works and when used right, is a more compelling, more readable document than a spreadsheet where the “working” is hidden.

Get started

First we’ll need a notebook service. You have two choices:

Install locally using these instructions. Run jupyter notebook when you've installed everyhing.
Sign up for a notebook-as-a-service such as IBM’s Data Science Experience to run notebooks with just a web browser.

Notebooks are arranged as a vertical sequence of cells. Each cell can contain either Markdown content (text, images, bullet-points, etc.) or code. When you “run” a code cell, the code is executed by the notebook’s kernel and the resultant output is displayed beneath your code.

Load data

Spreadsheets are stored on disk in their own file format (e.g., accounts.xslx) or in the cloud, but data is commonly distibuted in a format called “CSV,” which stands for Comma-Separated Values. It is a simple, text file format that all spreadsheets and most databases understand. Using Google Sheets as an example:

Export a sheet as a CSV file by choosing File --> Download as --> Comma Separated Values from the menu.
Import a CSV file by choosing File --> Import --> Upload.

With a notebook, it’s all handled programmatically. We can use PixieDust, which has a sampleData function which can load CSV files from a remote URL or a local file. First, import the library into your notebook:

!pip install pixiedust
import pixiedust

Then we can use pixiedust to load our CSV data from a local file:

homes = pixiedust.sampleData('/path/to/housesales.csv',forcePandas=True)

Or from a remote file:

homes = pixiedust.sampleData('http://internet.com/my.csv',forcePandas=True)

If you don’t have a CSV file on hand, try this cell in your notebook:

url = 'https://raw.githubusercontent.com/glynnbird/sampledata/master/housesales.csv'
homes = pixiedust.sampleData(url,forcePandas=True)

The Python variable homes is now a "Pandas DataFrame" (the forcePandas=True parameter sees to that). But what is a Pandas DataFrame?

What is a Pandas DataFrame?

The Pandas documentation describes a Pandas DataFrame as a:

“Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes.”

Which is a bit much, but really means “a spreadsheet”: data arranged in rows and columns.

Pandas is a data manipulation toolkit, so let’s look at some simple spreadsheet-like operations you can acheive in your notebook.

Counting things

The first question to ask is “how many rows do we have?” In a spreadsheet, we could scroll down to the bottom of the sheet and see which row number we get to or create a formula to count the number of rows used in the ‘A’ column:

=COUNTA(A:A)
20000

In our notebook, we can simply call the count function on our homes DataFrame:

homes.count()
_id          19999
price        19999
date         19999
postcode     19999
a            19999
b            19999
c            19999
house_num    19999
flat_num     19999
street       19999
addr1        19999
district     19999
town         19999
county       19999
type         19999

The discrepancy between the spreadsheet’s answer and the notebook is because the notebook has correctly surmised that the first row of the CSV file was the column headings, and it used those headings as column headings in the DataFrame. The spreadsheet has included the header row in the answer (!). So if there are around 20,000 rows of data, where are they?

Where is the data?

The data is there, you just need to run some commands to bring it to the surface. We can use the homes.head() to show a table of the first few rows of data to get a glimpse of it:

The output of Pandas head() function on a DataFrame.

To see the column headings, call the list function:

list(homes)
['_id',
 'price',
 'date',
 'postcode',
 'a',
 'b',
 'c',
 'house_num',
 'flat_num',
 'street',
 'addr1',
 'district',
 'town',
 'county',
 'type']

Or to get more information about each column, we can use the info function:

homes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 15 columns):
_id          19999 non-null object
price        19999 non-null int64
date         19999 non-null object
postcode     19999 non-null object
a            19999 non-null object
b            19999 non-null object
c            19999 non-null object
house_num    19999 non-null object
flat_num     19999 non-null object
street       19999 non-null object
addr1        19999 non-null object
district     19999 non-null object
town         19999 non-null object
county       19999 non-null object
type         19999 non-null object
dtypes: int64(1), object(14)
memory usage: 2.3+ MB

Now you’ve got a feel for the column names and data types, you can really get to work and ask questions of the data.

Asking questions of our data

If we are only interested in “expensive” house price transactions (those over £1m), then we can create a new DataFrame that contains only that slice of data:

expensive_homes = homes.loc[homes.price > 1000000]

Here, the loc operation is used to define the subset of the data we wish to deal with.

We then use our new expensive_homes DataFrame in the same way as we used our homes frame:

len(homes)
19999len(expensive_homes)
296

We can apply multiple loc operations in sequence:

expensive_homes_in_london = homes.loc[homes.price > 1000000].loc[homes.district == 'LONDON']
len(expensive_homes_in_london)

Which reduces our expensive_homes data to an even more narrow filter: only those in the London area. This is then stored in a third DataFrame called expensive_homes_in_london.

DataFrames are like worksheets in a spreadsheet. You can have as many as you like.

Formulae

In a spreadsheet, it is easy to add an additional column that is based on a mathematical formula calculated from other columns, e.g.:

=A:2 + B:2 / 100

You can then “fill down” the formula through the spreadsheet, and it will calculate a fresh value for each row. The equivalent of this operation in a data science notebook is to first create a Python function that is analogous to a spreadsheet formula:

import re
def extractSector(postcode):
    m = re.findall(r"^[A-Z]+", str(postcode))
    if len(m):
      return m[0]
    else:
      return ''

In this case, the extractSector function takes a postcode string and extracts the leading letters from the string (e.g., 'HR4 0LU' becomes 'HR'). We can test our code in the notebook by trying some example values:

extractSector('W1A1AA')
'W'extractSector('WC17XU')
'WC'

To create a new column (‘sector’) that contains the postcode column fed through the extractSector function, we can do:

homes['sector'] = homes.postcode.apply(extractSector)

Now if we look at our data, we have a new sector column:

Previewing the new column added to the DataFrame.

You may also notice that our expensive_homes frame doesn't have the sector column because it is a copy of a portion of the original homes data—a copy taken before the sector column was present.

Next, let’s start to summarise our data set by performing grouping and aggregation.

Aggregation

In a spreadsheet I can create the count or sum of a column pretty easily with a formula such as:

=SUM(A2:A20000)

Oftentimes, however, we really want to do grouped aggregations. For example, “Find me the top 20 postcode areas that have the highest median house price”. To do this query, I need to calculate the median house price for each postcode sector and sort the list by value. This is where I lose it with spreadsheets: pivot tables.

Pivot tables allow you to choose the field to group by (district) and the field to aggregate (price), and which aggregation to use (min, max, mean, median, etc.). They are pretty powerful, but things get ugly quickly as the size of the data set increases.

In my opinion, the Pandas DataFrame library has a more elegant solution.

homes_by_sector = homes.groupby("sector").aggregate({"price":"median"}).reset_index().sort_values("price", ascending=False).head(20)

I’m creating a new DataFrame homes_by_sector here based on our original DataFrame homes. There is a chain of operations going from left to right:

group by the sector field
aggregate the price field using the median operation
reset_index()—we'll come back to this later
sort the new data set by the price, descending
prune the resultant set to leave only the top 20

If we print the result, we get:

sector   price
96      W  675000
98     WC  640000
29     EC  588425
85     SW  575000
62     NW  549950
56      N  475000
94     TW  395000
44     KT  395000
1      AL  390000
35     HA  385000
9      BR  372500
28      E  370000
77     SL  359250
74     SE  349997
42     IG  347500
99     WD  325000
70     RH  321000
34     GU  318000
30     EN  310000
78     SM  310000

The top six postcode sectors with the highest median price are, not surprisingly, in Greater London.

What are the numbers in the left-most column? Those are effectively equivalent to row numbers in a spreadsheet. They were created by the reset_index() call in the middle of our chain, and after the sort operation, their order was changed because we sorted by price. The reset_index call isn't required to show the DataFrame as text, but we need it when we get to the visualization process.

Visualization

Visualization is where Pixiedust can really save you some time. Simply call display, passing in the DataFrame you wish to render:

display(homes)

PixieDust’s table view, automatically generated by a valid DataFrame.

Here, we’ll take our homes_by_sector frame and use PixieDust's chart drop-down menu to configure a bar chart, setting the keys to our "sector" column and the values to "price". Other settings, such as sort order and legend toggle, can be adjusted once the chart has rendered:

display(homes_by_sector)

PixieDust’s bar chart visualization, configured in a simple GUI.

The PixieDust display function allows tables and charts to be presented within notebooks by choosing the chart type, aggregation,and sorting options from the graphical user interface. No additional code required. It's very similar to how charts work in spreadsheets.

Believe it or not, without PixieDust you would have to write a ton of Python code to generate a pretty plot like this.

Conclusion

If you’re a spreadsheet user and your worksheets are becoming unwieldy — and you prefer writing Python to Excel macros & formulae — then you can switch to Jupyter Notebooks to crunch the numbers.

A notebook’s top-to-bottom flow makes them easily readable by humans and allows annotations, code, and visualizations to be presented to tell the narrative.

The Pandas library has a wealth of data filtering, grouping and aggregation features, and Pixiedust can help to get the data in from source files and render the results as tables, maps or charts.

Spreadsheet to Notebooks cheat sheet

Links: