Jupyter & Python in the corporate LAN

TL;DR

We (an investment bank in the Eurozone) are deploying Jupyter and the Python scientific stack in a corporate environment to provide employees and contractors with an interactive computing environment to help them leverage their data, automate their processes and more generally to promote collaboration and sharing of ideas, services and code.
We provide some configuration samples and example notebooks (Follow the links).

A — Objectives

In large organisations, it is typical that people who need to automate tasks struggle to find, install and configure software because they are in constrained environments, are not specialists (nor interested in becoming one) and want results fast. This results into the duplication of the same effort multiple times in multiple departments of the same organisation, sometimes even as they are on the same floor. In aggregate a lot of time is wasted and people usually end up with a mildly satisfactory patchwork of tools, more or less up to date.

So we wanted to provide a widely-used open-source cloud-based versatile interactive computing environment to a vast and diverse population of engineers, researchers, traders, quants, sales marketers and trainees.

Why cloud based?

  • Because there is not set up whatsoever from a user standpoint, and it remains up to date. It is accessible from any connected computer with a browser, including meeting rooms. Even authentication is managed by the existing architecture.

Why versatile?

  • Because a diverse population has very different needs, from data crunching to numerical simulations, web scraping or visualisation. Basically there is no way to predict what users will actually do with the tools.

Why interactive?

  • Because business users are always in a hurry (under pressure) and won’t spend time in long set ups: Interactive is less intimidating / more intuitive because of the instant feedback, and allows fast iterations.

Why open source?

  • Because it is free, and it can be improved / adjusted / contributed if necessary. It is safe to say that corporations increasingly recognise the value proposition of open source software.

Why widely used?

  • Because documentation and examples are all over the internet, and skills are in supply. In this field, whatever you want to do, it is likely that somebody has already faced the same question and posted a solution and Google will find you the answer.

B — Decision

So after some years of prototyping and trials, we have picked Jupyter and the Python scientific stack. The architecture is based on Continuum Analytics Anaconda Python distributions, package repository Artifactory, GitHub Enterprise and the Docker platform.

C — Why Jupyter?

1 — The Notebook

First because of its notebook.

Jupyter Notebook

The Jupyter notebook is the friendliest, most convenient, powerful and stable interactive computing environment we know. It mixes rich text cells (markdown, LaTeX and raw html), code cells and rich output (containing the computation results). The output can be anything a web page can display from simple text to dynamic visualisations. So it is intuitive to run and as clear to read as a research document.

We think that the key design decision to use a web page rather than build yet another proprietary container, which in effect put the Jupyter notebook in the position to stand on the shoulders of giants (the big tech companies effectively building the internet), is one pillar of its success.

2 — The Ecosystem

The notebook is the flagship of the Jupyter ecosystem which has many other critical components:

  • JupyterHub: Server delivering Jupyter notebooks to remote users — designed for organisations — after authentication.
  • Multi Language: Beyond Python there are many different kernels, like R or Julia.
  • IPywidgets: Web widgets (sliders, checkboxes, drop downs, text areas, etc) in the notebook open a new avenue in terms of input. It is not code only any more. The interactive experience nears that of a web page. Which improves sharing potential.
  • NBViewer: Notebook viewer is a service to render a standalone notebook file as a web page, for static sharing. More generally NBConvert, which powers NBViewer enables to create html or latex documents from notebooks.
  • GitHub integration: GitHub Enterprise (like github.com) renders notebook natively.
  • JupyterLab: The new interface which will contain and expand the traditional notebook to a full fledged IDE. It is still in beta but the deployment will come at no cost on the same infrastructure.
  • Slides: Jupyter notebook can be converted to reveal.js slides. So the narrative of a notebook can go to the meeting room screen.
  • Extensions: The notebook and its Python kernel are extensible. It is relatively easy to customise them. See this repo for a practical exploration of the topic. For the near future, JupyterLab is designed to be modular at its core.

In short, the Jupyter eco system is a powerful enabler to the mildly IT-literate user.

3 — Is it a bold decision?

Well by all means, no! :-)
I would even say it is the most natural choice.

Major cloud service providers such as Google Cloud, Microsoft Azure, Rackspace have commercial deployments of Jupyter in their offering. Bloomberg funds several Jupyter core developers and is said to be working to release a similar product via the Bloomberg terminal this year.

This year’s JupyterCon (Aug 2017, NYC) is a large event, 4 days long, heavily sponsored — by O’Reilly an influential technology publisher.

D — Why Python?

Python is easy, very versatile, and everybody already knows it.

1 — Simple syntax

Its syntax is compact, forgiving, easy to read, well documented, and there are examples all over the internet.
So it is ideal to write simple scripts — hence perfect for business users.

Note: To be future proof, we encourage Python 3.x but also provide Python 2.7.

2 — Vast Ecosystem

Talking of ecosystem, Python’s is boundless. It can tap a vast amount of resources — because people have written wrapper code for just about anything over the years.

It is the glue language par excellence.

The benefit of using the same top level language is that the entry cost to leveraging another part of its ecosystem is low. For example if you are used to doing scientific computing with numpy and scipy, you will quickly pick up the syntax of machine learning packages like scikit-learn or keras (over tensorflow).

Here are some examples of the areas, packages we have used, are exploring, with a few sample notebooks:

  • Scientific Computing: numpy, scipy, numexpr are well known proven tools in this domain. As an example, this notebook compares 3 human longevity models and can be read as a scientific article or downloaded and run and modified.
  • Dataframes and Time Series: pandas is the indispensable library. Time series data are ubiquitous in finance and pandas capabilities in their manipulation and calendar management is outstanding. Not so surprising since Pandas was started by Wes McKinney about 10 years ago while working for hedge fund AQR.
  • Compiled Python: cython, numba, xtensor-python. As convenient as Python is we sometimes feel the need for more speed when running intensive numerical simulations. There are several way to do that. From our experience, it is simplest to use numba decorators to accelerate numerical python code to quasi C speed. Alternatively you can slightly amend your python code to cython syntax and compile it with the notebook cython magic. See how you can gain ~x40 speed ups in this example. But for more elaborate cases, it is usually more robust to go back to traditional compiled languages such as C++ or C#. For C++ we particularly like xtensor-python which leverages pybind11 and also xtensor. pybind11 made interoperability between Python and C++ easy and xtensor brings to C++ the convenience of numpy array manipulation (in short). Here is an example notebook that uses pybind11 and xtensor-python (and ipywidgets). For C# we are considering a similar wrapper, but don’t have anything yet.
    The main point is that there are various ways to run serious numerical computation from Python.
  • Parallel Computing: ipyparallel. Another — complementary — way of getting more speed is to go parallel, if possible. This also is made easy thanks to ipyparallel. It enables you to leverage multicore machines sharing an access to a common drive. In our experience the most convenient way is to have each process write their results on the shared drive and have a process that rakes them all when all have finished. As shown in this example notebook. Let’s say you have 3 relatively idle desktops on the LAN each with 8 cores. You don’t want to clog them so you pick 6 cores on each. That’s already a x18 minus overhead speedup, in our experience about 70% of the nominal multiplier. Obviously you can get larger speed gains if you can access cloud VMs.
  • Big Data: pyspark. Apache Spark is gaining traction as a major analysis suite for big data. Spark has a rich API for Python and several very useful built-in libraries like MLlib for machine learning and Spark Streaming for real time analysis. Jupyter is a convenient interface to perform exploratory data analysis and all kinds of other analytic tasks using Python — with pyspark — as described in this article. We are in the process to aggregating all the bank raw data in a massive loosely structured ‘data lake’, so Jupyter can be useful there too.
  • Widgets: ipywidgets. IPywidgets are interesting because they enable quasi web app user experience. For example here is a simple BlackScholes calculator web page and what is looks like as a notebook (github repo). Note the nbviewer rendering of the notebook is static (It would become live with a JupyterHub). Both are not on par in terms of user experience, yet, but you see the direction. Visualisation tools, like altair introduced above, come with interactive widgets. Similarly dataframes can be explored with widgets.
  • Documentation: readthedocs, sphinx, nbconvert. Python has robust and widespread documentation tools like sphinx and equally widespread services like readthedocs hosting sphinx-built documentation. So we plan to integrate links to notebooks in sphinx documentation so that a user clicking on such link would open a live notebook (powered by JupyterHub) and immediately experiment with the examples in the docs. For those users who need produce reports, they typically do not want the code cells to appear, only the text and output cells, and probably remove the cell numbers and overall polish the format a bit. So we can write a notebook extension (like these) to enable them to hide/remove selected cells (using metadata) before conversion to html or pdf.
  • Machine Learning: scikit-learn, xgboost, keras over tensorflow. Machine learning is hot and much talked about. Whether or not it has many direct applications in finance, beyond the hype, is an open question. The best way to start answering is probably to explore and prototype. Here popular and very well documented scikit-learn library is the obvious starting point. For more advanced researchers, tensorflow is another candidate, but it’s complex to the non specialist. Fortunately keras offers a higher level API which should ease the approach to neural networks.
  • Web Servers: flask and its ecosystem, django. In this section’s list, the web server frameworks are independent of Jupyter notebooks. The 2 major frameworks we use are Flask and Django. Flask is simple to write and has many convenient extensions, which makes surprisingly powerful and possibly accessible to the business user. Django is the industrial power horse everybody knows.

3 — Taught in universities

Last but not least, Python is often the default computer language taught at science universities (probably because you can express concepts concisely and experiment quickly — have a look at Peter Norvig’s notebooks for a demonstration!)
So it means that fresh trainees (the corporate dark matter that holds it all together ;-) arrive with that skill.

4 — Weakness: Speed

Now its many advantages come at a cost: Speed.
Fortunately Python’s ‘wrap-it-all’ capabilities brings the solution, as already mentioned in the section above. Let us develop a bit.

Very often the critical quantity is not computer time but human time. Meaning that you prefer writing quickly slow code than the opposite. Here Python excels.

But occasionally you really need speed, e.g. for numerical simulations. If the computations are structured array manipulations (column operations, or matrix multiplication), then numpy (C inside) or pandas (Cython inside) has your back and enables you to write concisely. But we have cases where we need simulate path dependent trajectories, for example, and this generally cannot be done with array manipulations. Then there is no way around compiled code.

Python offers several ways to do that and I mention 3:

  • Cython: Pseudo Python language with extra type information that is compiled where possible (it silently reverts back to Python where not) and callable directly from Python without the user having to deal with low level configuration. The benefit is the integration in Python, the full control over the code, and the C-like execution speed.
  • Numba: Just-in-time compiler that turns plain Python code into compiled code with the help of a few annotations at the function definition level. It can be asked to silently revert to Python if compilation fails — or explicitly fails to guarantee speed upon success. The big benefit is the very small change required in the code to compile it, the absence of any low level configuration, while the drawback is the total lack of control over what goes on under the hood.
  • xtensor-python: Wrapper for C++ code extended with xtensor. In short xtensor is a sort of numpy for C++ which makes it easy to write array-wide operations, with numpy-like broadcasting rules and lazy evaluation notably useful for large data sets. And xtensor-python is a wrapper that makes it trivial to expose C++ functions and package them as Python modules as shown in the cookie cutter. This approach enables the leverage of complex and/or existing C++ libraries. We think this library has a lot of potential. Apparently it may even power next gen pandas…

Note: These solutions are tedious to set up in the context of a Windows desktop as compiler installation is quite cumbersome on Windows while it becomes simple in the context of a cloud deployment based on Linux Docker images.

E — Architecture

The architecture rests on several other building blocks: A Python distribution, a package manager, a shared code repository, and a Docker platform with RedHat Linux based images.

1 — Anaconda

Anaconda is the leading Python distribution and comes with the very convenient package manager conda.
This distribution contains over 200 packages and their dependency trees. This is incommensurably helpful, and reliable now that Continuum thrives and millions have downloaded and use Anaconda.

Beyond the packages included in the Anaconda distribution, all PyPI packages can be installed, as conda manages pip packages seamlessly as well. Also conda gives access to conda-forge, an open source conda channel which enables complex install procedure (beyond pure Python), and is growing fast, and is officially recommended by Continuum.

Package management may sound like a secondary topic, but only until it becomes an issue. Because then it is a crippling issue that slows down an advanced user to a drag and completely stops business users who, again, are always in a hurry and don’t want to be bothered. It is important that ‘it just work’.

2 — Artifactory

Artifactory is a binary repository that allows:

Safe access to public packages

It enables easy access to community packages by mirroring remote repositories, such as Pypi.org and conda. Thus anyone inside the corporate LAN can access those packages as if they were outside. Obviously having a single point of contact with the internet also greatly simplifies security management.

Publishing of proprietary packages

Packages created and built internally can be published and stored in Artifactory, with a process similar to that of PyPI. They become available to users via package managers like conda or pip. Besides access control can also be tightened to restrict access to specific users, to keep some modules relatively private.
While we should try to publish the packages publicly by default, those that contain IP will stay indoors.

3 — GitHub Enterprise

GitHub.com is ubiquitous. GitHub Enterprise is the on-premise version of github.com, allowing to use the very well known source code management tool within the corporate LAN with all the good that comes with it :

  • Version control
  • Reuse and collaboration via Pull Request
  • Easy interface to modify

As such, it will be used as the primary notebook repository, in the context of JupyterHub. This also leverages the fact that Jupyter notebooks are rendered natively in GitHub. It also provide OAuth authentication, very useful to control access to JupyterHub based on the internal referential.

4 — Docker platform

Docker is a recent technology, allowing to build and run and deploy self contained applications in isolated containers. It has become the leading software container platform, maybe because it is very easy to make such containers and deploy them on a cloud, public or private.

Docker Swarm is a way to aggregate multiple machines in one cluster. It automatically manages the distribution of containers on the set of machines, thereby simplifying cluster management and enabling large scale deployment.

Docker Datacenter is Docker’s enterprise solution. It has two main components :

  • Docker UCP (Universal Control Plane), which is very basically a big Docker Swarm with access control along with enterprise features for security, scaling, etc…
  • Docker Trusted Registry, an on-premise Docker Hub allows to store Docker images internally.

For JupyterHub, we are using Docker containers to provide an isolated workspace for each user, pre-loaded with various set of modules. We have 4 base images:

  • basic (miniconda)
  • std (anaconda + proprietary packages)
  • custom machine learning (std + keras & tensorflow)
  • custom speed (std + xtensor-python & g++ & xtensor)

5 — RedHat Adaptation

A lot of companies use RedHat as the main Linux distribution, due to support contracts. It is not always so in the community though, due to licences constraints and outdated packages being very common in RedHat.

As a result, we had to adapt the JupyterHub images from Debian to work on RedHat, the only Linux OS available inside the corporate LAN.

You can start fom the Debian base notebook Dockerfile provided by the Jupyter team.

From there the first obvious modification is to switch from apt (Debian package manager) to yum :

Second retrieve all your binaries from a local mirror. For this, you can use Artifactory as a local mirror to retrieve all the binaries. You can create two generic repositories pointing to the Anaconda repo and the Miniconda repo, and check hashes from here.

Then you will need to configure your install to point to your internal mirrors for Pypi and Conda repos. You’ll need a pip.conf file to configure pip, and a .condarc file to configure conda

Once done, you must install the JupyterHub packages below, as well as all those you need for your default environment.

Finally you can copy the launch script and run them.

F — Easy Sharing

The single most important feature we aim at is easy sharing.
Ideally a user should be able to

  • open a notebook in JupyterHub from a link on a github repo
  • open a notebook from a link received by email
  • give access to a notebook by storing it on a github repo
  • give access to a notebook via a cryptic url
  • access to network drives to read write data

We use the 3 methods to share notebooks.

Sharing via Git

We use on-premise GitHub Enterprise solution, so sharing using git is pretty straightforward: Pull requests, repositories (with access control) and a basic GitHub workflow works great to share modifications.

It does imply a basic knowledge of git though : how to pull, commit, push, use branches. This is not a given for all the user population: The business users have never heard of it.

Sharing viva NFS

Sharing via a Shared File System is also a solution: A working prototype is in place but needs to be improved. Access through Docker containers is possible, and would allow to read/write from a notebook and access the results from the Windows File Explorer.

Sharing via Docker NetApp plugin

Docker’s netapp plugin is another way to have Docker Volumes shared between containers, and stored on the Docker Datacenter. We don’t have a working prototype yet, but we are investigating.

G — Try on your desktop

This repo contains sample scripts for the architecture described above.

You will need to have a recent version of Docker installed.

In short, it spawns (1) a JupyterHub instance that uses Github for authentication and (2) an Artifactory instance that acts as a proxy to Pypi and continuum repos.

1 — Create a GitHub application

The first step is to register a new OAuth application with your Github account, and get a Client Id/secret.

To do this, fill this form with the following info:

  • Application Name : Whatever
  • Homepage URL : Whatever
  • Application Description : Whatever
  • Authorization callback URL : https://0.0.0.0/hub/oauth_callback

The Authorization callback URL is the url where to redirect after authentication. Here, we’re pointing to the future URL of your JupyterHub instance, running locally.

2 — Make your setup

Next we will create a file containing the variables to be used in our sample.

Create a file .env in the jupyterhub-sample folder, as follows:

The important variables here are your Github client ID and secret from step1, and HOST_ARTIFACTORY_HOME which should be a folder on your machine that the Artifactory instance will use to store data and to which you have easy access. This is mostly to look at application logs in case of problem. For example, mine is : /home/christophe/Dev/artifactory_home.

The other variables are :

  • DOCKER_NETWORK_NAME : Name of the Docker network, to have everything on a subnet.
  • DOCKER_NOTEBOOK_IMAGE : Name of the notebook image to be spawned by JupyterHub. We create it right after.
  • Miscellaneous utilities (starting point, work dir, volume conf, etc…)

3 — Build a custom notebook image

Next, we need a notebook Docker image ready to be spawned by the JupyterHub. For this, run the following command:

This will build the image described in this Dockerfile, based on RedHat Centos and with gcc installed for cython and stuff. As you can see, we use the name custom_gcc_miniconda used earlier in the .env file.

4 — Get an Artifactory licence

You need a trial licence with Artifactory, which you can get here. Paste the licence you get in this file: jupyter-sample/artifactory/artifactory.lic.

5 — Run

Finally launch the stack from the folder jupyter-sample by running:

You should see a lot of logs about artifactory and JupyterHub, and you should be able to connect to both instance:

You will notice three existing repos in Artifactory:

  • A proxy to Pypi.org, ext-pypi
  • A local repository for Pypi packages, local-pypi
  • A virtual repo containing the two other : global-pypi

Our custom notebook image is configured to point to the virtual repo global-pypi. Check the pip.conf file.

Now you should have a working stack! Connect to JupyterHub, spawn your own kernel, install any packages you need via pip and work/test as you wish.

Conclusion

We share this corporate development first as a token of reconnaissance to Jupyter developers. The tool is just great. The vision is just right.

We also want to contribute back to the ecosystem, even modestly.

At a time when corporations increasingly recognise the value proposition of open source software, how to best leverage it is still very much an open question. So experiments must be made. Well, Jupyter and Python ecosystems are a safe bet !

Finally we would like to ask for recommendations or opinions on all the points mentioned above, particularly the sharing features. As I pointed out it really is critical in our view. So feedback is welcome.

Christophe Lecointe, Olivier Borderies

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store