Jupyter — A Working Quant Engineers Thoughts

Nick Wienholt quantivfy.eth
9 min readAug 13, 2022

--

The praises of Jupyter have been sung in a previous post, and its a truly revolutionary product that fully supersedes the majority of Excel use cases. Excel is to data analysis as MSPaint is to digital artwork production. Compared to other ecosystems like Java and .NET with established IDEs and well-structured use cases, Jupyter is still very much finding its feet, and there are a lot of exploratory efforts that are likely to be dead ends. This article will look at the good and bad of the current Jupyter world, and what best practices are emerging.

The Basics

Jupyter is a multi-language computational document framework that attaches a document/ lightweight IDE to an execution kernel. The default experience is an HTML page hosted in a web browser that uses WebSockets (through the ZeroMQ sockets library) to communicate with an external process that manages the execution and programmatic state of the document. The default external process is the IPython kernel, which reflects the heritage of the project focussing on the Python language. The front end is written in TypeScript (a type-safe implementation of JavaScript which is compiled to standard JavaScript for browser execution), and the kernel is written in Python.

Messages between the kernel and UI are delivered via a JSON format, and can be inspected easily using the browsers debugger tools:

Jupyter Executing in the Browser

The complete architecture is shown below and is described in much greater detail in the Jupyter documentation. There are also dozens of active extension projects in the wider ecosystem.

Jupyter Architecture (image from https://docs.jupyter.org/en/latest/projects/architecture/content-architecture.html)

As the diagram above suggests, the design of Jupyter is very well-factored and extensible, allowing different UIs and kernels to be slotted in. Any language with some measure of adoption is supported in some way by Jupyter, and the project is open-sourced at Github, allowing an infinite level of extension and customization.

The actual format of a Jupyter document is text-based JSON, and many platforms such as Github and Confluence (via a plug-in) support the rendering of a document from the raw JSON to a user-friendly form. The underlying JSON format is relatively complex, but will generally be opaque in nearly all Jupyter use cases. Where it does bite is with two authors making simultaneous changes to the same document — a raw merge conflict resolution is very difficult to achieve. An excellent article on TDS provides some options on merge tooling as of two years ago, and since this article nbmerge has become the most common and accepted way of combining two versions of a document.

Authoring Experience

For simple work, the default Notebook experience is fine. Limited AutoComplete is available, as shown below — this is activated by hitting the tab key in the browser as code is being written — and parameter names can also be suggested, but there is a lot of noise in the suggested options and for developers used to the level of sophistication in modern IDEs, the experience will be underwhelming.

AutoComplete in Jupyter Notebook

A secondary browser-based experience called JupyterLab is available via pip or an Anaconda install, and if the default Notebook experience could be described as an IDE experience circa 1989, JupyterLab takes the Delorean all the way to 1995 to give the author multiple panels and consoles. One interesting feature is the ability to have multiple parties simultaneously edit a document, which is a familiar undertaking for most users of GSuite or Office365 tools. As a developer, I have always lacked the capacity to pair program well — the deep computational task of maintaining a logical call stack of what is happening and what I am trying to achieve is incompatible with maintaining a passing level of human interaction, so the joint editing feature doesn’t look particularly compelling.

The Anaconda Distribution ships with the free and open-source Spyder IDE which is a deeply adequate authoring experience — akin to the LibreOffice alternative to Microsoft Office which I personally used for any legacy non-cloud document and spreadsheet tasks. However, over the last few years the investment that Microsoft has made in Visual Studio Code, and its cross-platform abilities, has made it the premier offering for Jupyter development. The global Intellisense in the IDE is fantastic (“oh — I see you are authoring an ipynb document — here are the extensions you need, and once they are installed, let me spin up the appropriate kernel and allow you to choose which conda environment it should use”), and the early clunkiness of Jupyter development where an external kernel needed to be launched independently and then attached to in the IDE is gone. From a freshly installed machine, the following steps will enable productive development in 15 minutes:

  • Install Anaconda, and optionally create a new Conda environment.
  • Install git and clone or create an appropriate repo.
  • Install Visual Studio Code, open a Jupyter document, and follow the prompts to install the required plug-ins.
  • Begin development.

As shown in the screenshot below, the quality of the AutoComplete prompt is radically better:

Visual Studio Code AutoComplete prompt

In addition to being a very, very good IDE in its own right, and improving monthly at an impressive rate, VS Code has two incredible benefits that basically exclude other IDEs from consideration — Github CodeSpaces and Github Copilot (Copilotis supported in a number of other IDEs through plug-ins too, but VS Code feels very much like the target use case). Both of these technologies will be examined in more depth in future articles, but the key takeaway is that CodeSpaces is VS Code in the cloud that is attached to a hosted virtual environment allowing full execution and debugging with zero local installation. Github Copilot is AutoComplete on steroids (and a bit of HGH, with heaps of vitamins and a sprinkling of peptides). It uses AI trained on the massive collection of Github repo’s to provide very detailed — like a full method body — code suggestions:

Github Copilot in Action

For the screenshot, the only context provided was the method name. The rest of the project’s codebase and Github’s massive history of similar code, were used to suggest both parameters and implementation.

CodeSpace and Copilot are both paid services, but come in at roughly the Netflix subscription pricing level, making them fairly accessible to most professional-grade engineers.

Productionisation

Once a Jupyter notebook is complete, the end use-case for the code and markdown vary. There are several common use cases:

Git and Python Setup to Prevent Credential Publication
  • Book Publication. Similar to notebook publication, but at a larger scale. The JupyterBook project provides tooling and examples of producing a complete book based on Jupyter. The Turing Way is a great and beautiful read, and eloquently demonstrates the power of Jupyter publishing.
  • Production Execution. The code developed in the context of the notebook may be destined for a life as a component or service in a software system — the easiest way to achieve this is to have a DevOps pipeline that uses nbconvert to generate a script that has all the markdown and magic cells removed, and the script will be deployed to an appropriate hosting environment where it can serve requests.
  • Hybrid Parameterized Execution. Netflix has developed and open-sourced a technology called Papermill to allow Jupyter notebooks to be parameterized and executed at a massive scale. As their blog post describing Papermill states, the use cases are varied:
    -Data Scientist: run an experiment with different coefficients and summarize the results
    -
    Data Engineer: execute a collection of data quality audits as part of the deployment process
    -
    Data Analyst: share prepared queries and visualizations to enable a stakeholder to explore more deeply than Tableau allows
    -
    Software Engineer: email the results of a troubleshooting script each time there’s a failure
  • Dashboarding. Using either Jupyter Widgets or Voilà Dashboards, it is possible to build interactive notebooks with traditional UI elements like sliders, buttons, and textboxes, supporting the creation of very complex and visually impressive data dashboards. However, these dashboards are a pain to work with at scale — the Jupyter kernel is spun up and exists on the server side, and two users attached to the same document share the same kernel in a standard deploy. The use of Voilà and deployment to Binder allows the use of a dashboarded notebook in a typical web experience, where the state is unique for each user. This is accomplished through a dedicated docker image instance created for each user, but it’s still a poor man’s dashboard compared to a dedicated product like PowerBI or Tableau. The main use case for a Jupyter dashboard is where a notebook has organically grown into an application through expanding business requirements and is not a core component of a company’s BI that would allow a re-write in a dedicated reporting/ BI tool.
  • Model Training. One of the most common uses of Jupyter in the quant community is training a model in a technology like scikit-learn, XGBoost, or Keras/TensorFlow. In these scenarios, the result is some model representation that captures both the architecture and the trained parameters of the model, and this trained model can be exported and executed in some production environment. All of the major cloud vendors have a machine learning platform that can offer model execution as a web service, and this is the recommended, orthodox way for serving a model for scenarios that don’t have extreme low-latency requirements. For the scenario where a data scientist or a quant is creating a model training notebook, the notebook should be integrated into a DevOps pipeline where the training is re-run in a clean environment, and a versioned endpoint is automatically created on the target cloud platform. For small-scale operations, aliasing the endpoint to a version independent /latest may make sense so production execution automatically flicks over, but a more nuanced approach of gradually moving asset allocation to the new model version and monitoring the results is best — more detailed thoughts on this approach will be covered in a future article.

Wrap Up

The art around Jupyter authoring and productionisation has become more established and an orthodox way of doing things has emerged over the last few years. One of the final frontiers on getting Jupyter complete is on the scalability front —default Python is ineffective at scaling even on a single machine due to the Global Interpreter Lock (GIL), and a lot of notebooks will sit on a machine with 16 cores available for hours on end hitting that magical 6% CPU utilization. Jupyter has a lot of efforts dedicated to scaling, and there are other competing notebook technologies like Spark which have scalability at their core. How this is all unfolding will be the subject of the next article.

--

--