Grey new world

There is a grey area between open source and proprietary platforms and in here might be the sweet spot for data science platforms.

Published in

Met Office Informatics Lab

5 min readSep 18, 2020

This research was supported by the ERDF through the Impact Lab.

The Pangeo community is pushing forward geospatial data science with a holistic suite of open-source projects. Similarly, cloud vendors are developing increasingly sophisticated data science platforms. Embracing a middle ground between these two worlds could unlock a number of benefits with minimal downsides.

Open or closed?

To be truly cloud agnostic you are forced into the lowest common dominator. You can only build on products/services available across a multitude of providers and must avoid all the ‘secret sauce’ that makes any one platform sing.

Open-source versus propriety is a well-worn debate that I won’t wade too far into but I’ll characterise two of the most important arguments:

Total Cost of Ownership (TCO): Open source doesn't always reflect the best long term value. Whilst open source may be ‘free’, given the resources needed to manage and maintain it, paying for a closed source option might work out cheaper.

Vendor lock-in: While a proprietary system might be the most cost-effective solution now it’s easy to find that you are “locked-in” and the cost to leaving the system is prohibitive even if better value can now be found elsewhere.

I’ve ignored many arguments and there is important nuance missing from those I’ve included. However, this sets the scene for the advantage of this middle ground.

Going grey gracefully

The vision for a ‘grey’ platform is one that is made up of multiple subsystems that may be open or closed source but that all conform to some open and consistent APIs. Through this design any one component can be swapped in or out with another that conforms to the same API. The goal here is to get the benefit of proprietary software (support, service levels, lower cost of ownership, latest features, etc) whilst minimising lock-in.

This is not a new idea, it’s a software and platform design that has existed for years. The tricky part is how to get everyone to agree on the ‘components’ and APIs between them.

Pangeo — the dominant design?

To have the best bet of being successful in this model we need to work with the emerging dominant design; the APIs and components that the community is reaching a consensus on. The Pangeo community provides an excellent vehicle for doing this.

Through the Pangeo community we can see that the dominant design is one based around the Jupyter ecosystem with a strong preference towards Dask for distributed workloads. Informed by this I’ll try to lay out what I think the Lego bricks of this grey new world are.

The Lego bricks

Compute Instance (CI)— This is an interactive compute environment exposed through a web browser. Jupyter Lab is the preferred flavour but arguably other offerings would be suitable if they conformed to the API. A JupyterHub Spawner is the API to which the compute instance must comply to qualify as part of our grey environment. I.e. you must be able to control the CI through the JupyterHub Spawner API.

Cluster — A cluster is a distributed set of compute that can be orchestrated to perform a task. Dask Cloud Provider would be the API through which this is exposed.

Orchestration — Provisioning CIs, routing users to their CIs, housekeeping, etc. JupyterHub is envisioned as the tool to do this. The other services should interact well with JupyterHub where applicable.

Authentication — The ability to control who has what access to the platform and ensure secure login. OAuth2 would be the API through which this is delivered.

The above are emerging strongly as the dominant design within the Pangeo community. Other components are much more in the pre-dominant design phase. That said I will speculate on them:

Experiment orchestration — In the field of Machine Learning (ML) there is frequently a need to run the same code many hundreds or thousands of times with different hyperparameters and understand the performance/result. I consider this one example of ‘experiment orchestration’; the process of running some process repeatedly with many different inputs and evaluating the outputs. There is no clear dominant design here with a range of libraries and cloud providers competing in this space. It is possible the Pangeo community can lean into one design or implementation to speed up the emergence, but the counter to this is the XKCD comic above.

Environment management — Ensuring the library and tools necessary are available to your CI, experiment or code someone shared with you is a constant struggle. Whilst Conda is perhaps the dominant design in this space the tools for sharing these requirements between CIs, experiments, teams, platforms, etc are not there or not reliable. The other contender in this space would be containers. Containers ensure a consistent environment but many users find them troublesome to use/create. Furthermore, containers do too much. If you have your app (Jupyter Lab) in container image A but your environment in image B you will struggle.

Homespace — I have written a previous post “Homespace, the missing as a service?” arguing for this. What we need is some common API (that plays well with the API for a CI) to ‘mount’ the range of files systems. FUSE is an important player in this space as perhaps is FSSPEC.

Data access — The ability to access and easily perform complex analysis on large datasets. This needs to support ‘universal access’, access from anywhere in the world (if authorised to do so). Equally important is parallel access; the ability to access the data simultaneously from many thousands of processes. Intake alongside technologies like Zarr and TileDB is perhaps emerging as a dominant design for this. A pattern that we see frequently, but we hope is waning, is “fake POSIX”. This is the approach treating data sets the same way that we did on local storage (same file types, same access patterns) and using FUSE or other technologies to mount as if POSIX. Ignoring the nature of cloud storage results in poor performance and higher costs for the dubious advantage of not having to think hard about the problem now.

Making it happen

In collaboration with Microsoft AI for Earth and The Alan Turing Institute we are putting this theory to the test. We are currently building an Azure Machine Learning Spawner so that Azure ML can fulfil the contract of a CI. This will be deployed in a Jupyter data platform using Azure Active Directory for authentication and the Dask cloud-provider AzureMLCluster for distributed clusters.

We invite you to join us on this experiment and would be keen to see other cloud providers and data engineers embrace the grey area.

Update

Inspired by some of the response to this post, here is a video showing one of our early prototypes. Very early days but I hope it demonstrates the concept.