The Intake Server Service

Exposing Intake Server as a service on Pangeo

Peter Killick
Met Office Informatics Lab
6 min readDec 20, 2019

--

One of the main areas of work focus in the Informatics Lab at the moment is on provenance — broadly defined as knowing the history of some object. We recently published an article that defined provenance in greater detail and laid out some of our thinking on provenance, including why we believe this is an important area of research. Here we will explore just one small, technical piece of work that contributes to the wider work on provenance.

Cataloguing what you’ve got — data, analysis routines and more (or, collectively, artifacts)— is an important part of provenance tracking. Within a catalogue we can record not only these artifacts but also extra information about each artifact, such as when it was created, who created it, how it was created, processing history, and more besides. This information records the provenance of each artifact and also allows us to build up a provenance chain for each artifact.

For example, this extra information about a given artifact, artifact_b, might allow us to determine that it was produced in an experiment run on June 15, by user21, from a script called my_processing_script, which applied a statistical collapse operation on artifact_a to produce artifact_b. This is a useful history, but it also allows us to derive the value of artifact_b. To further the example, it could be that the origin of artifact_a (as recorded in its further information) is a model run with a known bias. This bias therefore may be present in artifact_b as well, a fact we can derive from looking at its provenance chain (as potentially user21 should have done before choosing to use artifact_a to create artifact_b).

In Python, we can produce catalogs of data and more using the library Intake. Intake can also be used to stand up a server that allows you to serve catalogs over a network. Here we will explore how we exposed the intake server as a service running on a kubernetes cluster.

Intake In Detail

Intake is an open-source Python library that aims to make it easy to find and load data. This is achieved via Intake catalogues, which record both where datasets are stored (a filepath or a web link, for example to an S3 bucket) and how to load them (a driver in Intake parlance). This means that you do not need to remember where on a filesystem the dataset you want to use is stored (this is recorded in the Intake catalogue), nor the specific code needed to load the dataset into your Python session (a reference to the driver is also recorded in the intake catalogue). As such, the pain of finding and loading data is taken away, and you are free to focus on using the data, rather than trying to access it.

Intake catalogues are designed to store data, but in principle there’s nothing that limits them to only storing data. This means we could explore storing more of the artifacts of provenance in Intake catalogues, such as analysis routines described in Jupyter notebooks. Tools such as papermill provide functionality for automatically executing Jupyter notebooks, so we could produce an Intake driver for Jupyter notebooks that auto-executes Jupyter notebooks when they are loaded. All together, this gives us a simple provenance system all using Intake catalogues.

Intake Server

Intake server serves up one or more Intake catalogues over a network, accessible by connecting to the server’s endpoint rather than by reading an Intake catalogue from a file. This means you could run intake-server as a discrete service on a kubernetes cluster, which is precisely the aim here. Other services running on the cluster, such as user notebooks in Pangeo, can then connect to the Intake server’s endpoint and read the catalogues being served by the intake server.

Making the Intake Server Service

To create an Intake server service we followed these steps:

  • Create a conda env that includes Intake and all the addon drivers that we use (specifically intake-iris and intake-xarray).
  • Put the conda env in a docker container that exposes intake-server as the docker container’s entrypoint.
  • Create a kubernetes Deployment that itself creates a kubernetes pod running the docker container.
  • Add a kubernetes Service that exposes the pod on the network via DNS for easy discoverability.
  • Write a helm chart that wraps up all of the kubernetes elements for easy, customisable deployment on a kubernetes cluster.

What’s the benefit of this?

The main thing that Intake server provides is simplicity. So, you can use intake-server as a single service that can provides access to all required catalogues, without needing to know or remember how to access these catalogues. The running Intake server then acts as a single location from which you can explore and load all of your catalogued data.

The intake server endpoint can by default be accessed by all the other services running on the same kubernetes cluster. This means the intake server can be used to serve catalogues to all services on the cluster. With a little more setup (and due security consideration) it could also be accessed outside the cluster, which means it could be used to share catalogues of data with partner institutions, and to serve data that would otherwise not be visible on the cluster.

On a purely personal level, I learned a lot while creating the Intake server service. Despite having used kubernetes and helm a lot while creating and administering Informatics Lab Pangeos, I felt that my understanding of how these technologies actually work was lacking. The opportunity to have to set up kubernetes resources and then encapsulate that resource setup in a helm chart means I now understand much better how both technologies function.

Any Downsides?

The biggest challenge to the concept of the Intake server is whether it is really necessary. It’s already pretty easy to point Intake at one or more catalogues and access the data in these catalogues. In many ways doing this via a server rather than via the filesystem is just added complexity, as there’s now also a server to maintain.

The server functionality in Intake is still reasonably new and immature. This means it has limited functionality and occasionally does not function as expected. For example, one benefit we hoped to get from using intake-server is that we would not need to add extra Intake drivers to conda envs (particularly user-defined envs) on Pangeo. Unfortunately this is not currently possible with intake-server: all drivers need to be present both on the server and the client. Hopefully this and other limitations will be removed as the Intake server matures.

How do I use it?

At the moment this helm chart is only available in a git repo in the Informatics Lab organisation. You can install the Intake server helm chart onto your cluster as follows:

Here it’s assumed that you have already created a yaml file called customisations.yaml for customising the install of the helm chart onto your kubernetes cluster. You will also need to choose a namespace (in place of <my-namespace>) and name for the deployment (in place of <deployment-name>).

Next Steps

It seems that the Intake Server offering is a little immature at the moment. Once Intake server gains more functionality then this could be a useful mechanism for interacting with Intake catalogs — on Pangeo and on other Kubernetes clusters.

As we saw in the previous section, the Intake server helm chart is currently not particularly accessible, as it is not available via any of the standard helm repos. We intend to improve this by submitting the intake server helm chart to helm hub. We will also need to provide documentation for the helm chart, particularly for the configurable parameters in the chart.

On a technical level, the helm chart currently only allows you to specify a single catalogue file for the server to serve. While you can nest Intake catalogues so that all required catalogues are available from this single catalogue, it would be preferable to be able to specify multiple catalogues in the helm chart. This should be entirely doable technically, it just hasn’t been implemented yet!

Intake server uses server-specific drivers, called remote drivers, to load data from catalogues. Provision of remote drivers alongside standard drivers is good for the drivers provided as part of Intake, but not as good for 3rd-party drivers. For example, Iris and Xarray are Python libraries used extensively for analysing earth system science datasets, and are thus of particular relevance. Both of these libraries have standard intake drivers, but Iris does not have a remote driver at all, and Xarray’s remote driver (at time of testing) did not function equivalently to it’s standard intake driver. A further technical improvement that’s needed, then, is to improve the provision of 3rd-party remote drivers for Intake.

Finally, it would be really interesting to explore further the opportunity of sharing catalogues and data via intake server. This could open up many opportunities to share access to data that would otherwise be inaccessible.

--

--

Peter Killick
Met Office Informatics Lab

Cloud Platform Architect, open-source software engineer and technology researcher in the UK Met Office Informatics Lab. I tend to blog on these themes.