Add value faster from Data Science workflows

Donal Simmie
Data at NewDay
Published in
11 min readSep 28, 2023

Infuse quality into notebook-based workflows using nbdev & scilint

To copy a turn of phrase often wheeled out by a British record label boss and global household television personality: “I don’t like notebooks, I love them”.

This wasn’t always the case however. To share my enthusiasm and reflections with you on moving to a completely notebook-driven Data Science workflow it will help to give a little historical context as to how I got to this point in the first place…

Photo by Clay Banks on Unsplash

I have worked in applied research for more than a decade and have moved through various technologies and personal workflows during that time, but I feel the deepest contentment with the one I use now. As I have hinted at with the opening line you probably won’t be too surprised to hear that it is based entirely on Jupyter notebooks - but let me come back to that.

I will start from the pain that drove me to search for a better workflow.

I have always felt somewhat constrained by titles and blinkered expectations that can be in some tech roles; it is safe to say I enjoy the space between things — Academia and Industry, Data Science and Software Engineering, Research and Production. I moved to a Data Science role from a Software Engineering background and saw many issues but also many opportunities, things like clean code, quality and re-use were just not practiced in the codebases that I was working on then.

Working in those first teams I was learning a lot about how to use Jupyter notebooks to explore and discover patterns in data and how to develop and evaluate ML models. I was also bringing some knowledge of how to write code that could be depended upon to the team and this all felt pretty good initially.

Fast-forward a few years and many projects later, and I ended up where I believe many of you might have been at one point or another in your Data Science career, having worked on several projects that did not see the light of day — beyond maybe a presentation and a collective pat on the back.

Photo by Elisa Ventur on Unsplash

There are various reasons why projects might get dropped or not make it into production, many of these are not fixed by workflow optimisation. However, as a Lean and Agile advocate, I believe if you can release faster then you can (in many cases) find out if you are solving a worthwhile problem sooner.

Notebook to IDE workflow

My production ML workflow at that point was to do initial discovery in a notebook environment, then to switch to Engineering mode and extract the important functional components of that initial blueprint into a “proper Python” project. This would involve creating reasonable class structure, modular components, unit tests, known-good test data, a frozen Python environment and a bespoke Docker container on which to run things in production.

You can get some smaller teams where people can adopt Science and Engineering roles interchangeably but, from my experience at least, it is more common that this is passed from one person to another (typically Scientist to ML Engineer). Either way it still fits the same type of workflow— you write the exploration code in an environment suitable for exploration and then re-write or productionise the code using an approach suitable for production deployment. The below image outlines some issues with this workflow (either in the handoff or switch role mode).

Data Science to Production Code workflow
Sample exploration to production flow [source: author image]

There are a few practical problems with this workflow (in either mode):

  1. Data Scientists have variable exposure to the techniques that can allow you to structure and create high quality re-usable code.
  2. There is a lot to learn as a Data Scientist and the game keeps changing at an accelerated pace. Why would you read about Design Patterns, Unit Testing and Clean Code when you need to learn about Model Explainability, Deep Learning Architectures or keep up with the latest open-source LLM implementation (there will be probably be two more of these by the time you have read this article).
  3. If you pass discovery code from a Data Scientist to an ML Engineer to productionise the workflow this process likely takes one big up-front investment and then smaller subsequent updates as the method evolves. The one-time port is a process a team can get quite good at. the iterative updates are harder to optimise. Some reasons why this is the case: the ML Engineer may now be busy with other “higher-priority” work so the changes might start queuing up, some changes in model choice, data-prep approach might necessitate a wider revision of the structure of the codebase, meaning that previously extracted test data is invalid so relatively small changes in the process from the Scientists perspective require larger changes of the productionised framework. In essence keeping the axis of change (Data Science discovery) and the engineering of those changes in sync is challenging.
  4. Even if you can occupy both roles (and hence lose nothing in translation) the needs of both these operating environments still create tension. Exploration wants flexibility and speed of iteration whereas your production environment wants stability, known interactions between components and generally a higher level of certainty under more circumstances. Also the more time you spend engineering the solution the less time you are spending improving the target.

Enter nbdev

Can we do better than this? There is a new approach emerging in the notebook space that allows us to shift-left in the Data Science quality conversation. The library that has enabled the paradigm shift of the notebook being promoted from prototyping tool to the driving seat of change is nbdev. According to their Github page:

“nbdev is a notebook-driven development platform. Simply write notebooks with lightweight markup and get high-quality documentation, tests, continuous integration, and packaging for free!”

There are many benefits of using nbdev as part of a Production Data Science team. I now lead the Data Science area at NewDay and here are some of the features of nbdev we have found to have the most impact:

  1. Explicit separation of exploration from what is fundamental for the workflow to execute using the export directive.
  2. Introducing a fit-for-purpose testing approach for notebooks.
  3. In-flow documentation of a notebook that is focused on the reader and powerfully expressive thanks to Quarto Markdown (aids building towards published reproducible research).
  4. Git friendly workflow via pre-commit hooks and notebook metadata management.
  5. Being able to build a modular notebook workflow. It is easy to import functions from notebooks in your project — this puts shared reusable functions within reach of the team.

If you are still on the fence about having notebooks drive serious production changes then I suggest watching the following video by Jeremy Howard, who will do a much better job of convincing you than I can. If you’re already sold then you can check out the nbdev intro tutorial or video tutorial to learn how to use it.

Now that we have introduced the idea of letting notebooks be used outside just the exploration setting, it is a good time to say 👋 to scilint.

Introducing 🧐 scilint

scilint aims to bring a style and quality standard into notebook based Data Science workflows. How you define a quality notebook is difficult and somewhat subjective. It can have the obvious meaning of being free of bugs but also legibility and ease of comprehension are important too.

If you prefer to move out of notebook-based workflows, post-exploration to an IDE+Python mix, I encourage you to have another ponder on the benefits of staying in a notebook-based workflow. Notebooks have a strong visual emphasis and proximity to data. They are also the primary axis of change within Data Science — new ideas are gained from diving into data. So instead of packing up your code, re-writing it for elsewhere and the waste that entails, you can bring quality to your exploration workflow and spend more time building stuff that matters.

scilint is a quality inspection tool but when used with nbdev it is also a build tool for notebook-based projects. You will find you end up using it in this context often — especially once your notebooks are up-to-scratch. This “no-decisions” build tool (a wrapper over nbdev & nbQA combo) gives you style consistency and awareness of broken notebooks. It also crucially lets you start to benefit from CI/CD workflows within exploratory Data Science team environments.

Check it out over at Github and to help you get started there are two example repos (one using nbdev, the other not).

What is the new workflow?

Now back to that personal workflow, the one that prompted the change in feeling about how I view notebooks and just where they can fit in.

The workflow is based around two environments Jupyter notebooks with a terminal shell and the two libraries we have already mentioned: nbdev & scilint. You use the Jupyter environment to edit, save and run your notebooks and the CLI to simultaneously build your notebook project.

There are four parts to the workflow:

  1. Setup your quality standard (spec files)
  2. Learn how to use scilint_build
  3. Setup a pre-commit hook
  4. Create a CI job to ensure that the quality standard is ensured across the codebase.

(1) Setup your quality standard

We follow the progressive consolidation approach to Data Science code quality within our team. You start with minimal need for quality as you iterate quickly to find ideas that work then progressively add more quality as you depend on the workflow components more frequently. What this means in practice is that we don’t have a binary standard of pass/fail but a laddered approach with multiple quality gates.

Within scilint we have created the Quality Specs feature which lets you set multiple quality specifications which comprise an overall standard. The first thing you need to setup for the scilint workflow is your quality standard (or you can just copy the example version provided with scilint and pictured below).

Example Quality Standard — composed of multiple Quality Specs

The spec config yaml files let you set different values of quality indicators and hence create a quality standard that that includes exploration but also gets progressively more demanding the closer you get to scaled production workloads. Here is an example of a quality spec configuration:

---
exclusions: ~
fail_over: 1
out_dir: "/tmp/scilint/"
precision: 3
print_syntax_errors: false
evaluate: true
warnings:
lt:
calls_per_func_median: 1
calls_per_func_mean: 1
in_func_pct: 20
tests_func_coverage_pct: 20
tests_per_func_mean: 0.5
markdown_code_pct: 5
gt:
total_code_len: 50000
loc_per_md_section: 2000
equals:
has_syntax_error: true

The warnings here are threshold values for quality violations. two examples are the fraction of cells that are markdown vs code (markdown_code_pct) and the number of tests per function (tests_per_func_mean) defined in the notebook. The scilint README has all the details on what exactly these values mean and what the quality indicators are so head over to Github if you’d like to dive into the detail.

(2) Using scilint_build

The main command you use is scilint_build which (when using nbdev):

  • Triggers an automated conversion of your notebook code into Python module code using nbdev_export.
  • Runs tests across all of your notebooks using nbdev_test.
  • Runs the scilint_lint command.
  • Removes superfluous metadata from notebooks using nbdev_clean.

nbdev_export

Converts the important parts of your notebook to a Python library composed of modules that you can easily export as a package to pypi/conda or an internal registry such as Artifactory.

Hang on I thought you love notebooks… this tool rewrites your notebooks into modules — what is up with that? Hopefully this doesn’t feel hypocritical but this approach from nbdev seems to be the best of both worlds. The notebook remains as the driver of changes but the end-product that you publish, integrate with other services, etc is a Python package which is much easier to consume downstream.

nbdev_test

Tests your notebooks by running all cells in order. You can mark certain tests as being slow using an nbdev directive which is very helpful for long running ML training notebooks, see here for more details about testing nbdev notebooks. We will publish an article very soon on testing long running ML notebooks as it deserves to have its own treatment.

scilint_lint

The build command then runs the notebook linter and you will get a pass/fail based on whether any of your quality specs caught a violation. A markdown formatted report is displayed to the console and a CSV is written to disk with all the details of the report run.

nbdev_clean

nbdev_clean strips out superfluous metadata and is very helpful for collaboratively working on notebooks as you can actually parse diffs! Also it stops you pushing your kernel output data into source control which is another plus; especially for sensitive customer data.

(3) Add a pre-commit hook

We recommend using the scilint_build command as pre-commit hook which you can do using git natively by editing: ~/.git/hooks/pre-commit or using the pre-commit project.

scilint CLI precommit workflow
Output from using scilint as a pre-commit hook

(4) Setup CI Job

That is the essence of the iterative development workflow the last thing you need to add is a CI build job, we use the scilint_ci command for that which adds documentation to your nbdev project and ensures that your notebooks are working and of the appropriate standard before they are committed to your main branch.

Here is what that looks like as a Github action:

name: scilint_ci 
on:
push:
branches: [main]
workflow_dispatch:
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: fastai/workflows/quarto-ghp@master
- name: ci
run: |
scilint_ci

I really enjoying programming with this workflow. The best part of it, for me, is that it keeps the excitement of discovery (by exploring ideas) close to the ability to integrate those ideas into your business.

However I would like to call out one challenge that is still a little painful, in the hope that some of you may shine some light on a potential solution. Multi-file refactoring within Jupyter is painful (I mostly miss find-in-files & refactor extract method). Are there any good options here? Bonus points for solutions that would fit for Sagemaker Studio!

Wrapping up

As I said in the beginning, I enjoy the space between things. This space is explored less so it can often reveal valuable insights. The space between exploration and production within Data Science workflows is starting to offer some hints as to how it could be navigated better.

I hope and believe that nbdev will have a transformative effect on the view of the notebook within the Data Science workflow. If scilint can start the conversation of what quality means within a Data Science notebook workflow that would be amazing.

At NewDay we are developing tools and frameworks that are geared towards needs of our Data Science community by letting their environment become the axis of change for exploratory and production use-cases. A platform such as nbdev and tools like scilint will help drive higher frequency of changes and ultimately add more value from Data Science in less time.

We intend to release more tools with this same ethos in the coming months. At the risk of adding Yet Another Buzzword into the ML zeitgeist we use the term NBOps to refer to these libraries that drive production workloads from the notebook environment.

In the meantime I encourage you to give both nbdev and scilint a try.

Note: scilint is a relatively early release, we have tested extensively but on a narrow set of applications. There may be issues but we will endeavour to fix them quickly!

Also please let me know if you agree/disagree with this approach of driving quality into the notebook workflow? Maybe there is already a better answer out there?

--

--

Donal Simmie
Data at NewDay

Head of Data Science & Machine Learning. Passionate about making better decisions using data and iterative experimentation. @dsimmie