nbwhat? A primer on notebooks for data science

Published in

nbunicorn

3 min readFeb 12, 2019

Data science is widely recognised as a highly complex field.

After all, the fabled unicorn data scientist must have a deep understanding of big data technologies, advanced programming skills, and a knack for applied mathematics, statistical modelling and machine learning. They must be an expert in whichever niche sub-domain their current problem demands.

On top of all that they must understand systems engineering to be able to architect solutions, and be a great communicator to explain what they are doing and why it’s important. Then they need to be able to sell their vision and execute it to make a real impact.

This laundry list of skills requirements might sound absurd — and to many it is — but as a data scientist, I’m all too familiar with the steep learning curve to success.

Whether you are an aspiring or experienced data scientist, run a data science team or are thinking about jumping in the deep end, there’s one great tool you definitely need to know to help you along the way:

Notebooks!

What’s a notebook?

Popularised most recently by Project Jupyter, a notebook is a single document that allows you to capture and run code, display example output, document your motivations, explanations and findings, visualise data and create interactive elements.

A Jupyter Notebook demonstrating equations, Python code and interactive visuals (via Project Jupyter)

A notebook is typically structured into multiple cells, which allow the notebook‘s cells to be executed individually and in any sequence.

This makes them a great utility for any stage of the data science project lifecycle:

exploratory data analysis, which can easily be shared and iterated
designing complex machine learning pipelines in a logical flow
communicating and visualising results effectively
code review, especially when comments don’t provide enough detail

What languages can you use?

With many notebook alternatives available, almost every programming language is covered by Project Jupyter, Apache Zeppelin, R Notebooks, nteract and Google Colaboratory (to name a few):

Python
R
Scala
Julia
SQL
C++
Node.js
bash
and pretty much any other language you can think of.

Many of the notebooks allow you to write different languages in different code cells in the same document, and of course you can utilise the vast collections of open source libraries from PyPi, CRAN and other package indexes within your notebook code.

What else is great about notebooks?

Here are some more reasons why you should be using notebooks:

All of your documentation and code is in one place — this is great both for sharing with readers and convenient for authoring
Cells can be executed, modified and executed again, which means you can iterate your exploration and solutions very quickly
They are a great tool for education — it’s easy to write tutorials with partially completed code cells for students to fill in, and projects like nbgrader can be used to automate marking
They can be parameterised and executed at scale — papermill is a project that does just this

So what’s the catch?

Ok, so notebooks aren’t perfect and sadly they aren’t the best tool for the job every time. With complex documental structures they don’t always play nicely with version control systems, being able to run individual cells doesn’t encourage the best programming practices and they are notoriously difficult to put into production.

Keep these things in mind and you’ll almost certainly improve your data science workflows with notebooks.

How do I get started?

If you’re new to notebooks and want to try them out easily, you can run the notebook from the example above by following this link to Binder (a free service for running interactive Jupyter Notebooks in your browser — no login or installation required).

Stay tuned for more! In a future post I’ll share with you how simple it is to install and configure a notebook software on your own computer.