Data science is often like exploring the mountainous wilderness with a baby on your back

Designing a Data Science Platform upon Kubernetes

Ray Yamamoto Hilton
Eliiza-AI

--

While Eliiza is still a relatively young company, we have quickly found some common patterns that we regularly need to solve. We want to provide a platform for our team of Data Scientists, as well as clients, that can offer a Simple, Affordable, Scalable and Available set of tools.

Motivations

Obligatory photo of shipping containers to represent Kubernetes

Simple: The platform needs to be relatively simple to maintain and deploy, ideally using the least number of different configuration tools. It should also be simple to use: Single-sign on, simple dash-boards, etc

Affordable: taking advantage of things like spot pricing to get cheaper, opportunistic compute. Also, auto-scaling should bring the cluster size down when not in use.

Scalable: Able to scale up to support very large data and long-running, highly parallel compute tasks.

Available: Furthermore, we want to run everything in Kubernetes to give us the best possible chance of being able to run our platform anywhere Kubernetes can run (Google Cloud, AWS, Azure, on-premises and even on one’s own laptop).

These tools fall into three main categories: Code, Compute & Storage. Here are some that we’re using:

Spark-backed dplyr dataframes in R

Code

Data science is often conducted in interactive coding and visualisation environments such as Jupyter Notebooks, RStudio or Apache Zeppelin.

These environments provide support for writing code in R or Python (amongst others) for a variety of purposes such as querying, transforming and visualising information. While there are many powerful features built into these languages and environments, we hit system limitations (CPU, memory, disk space, etc) fairly quickly when dealing with larger datasets.

To augment this, we can use external storage and compute tools:

Cloud storage is a bit like a library, but without the books or the building.

Storage

There are a few types of cloud storage available to us.

Object Stores

We can store large amounts of data in object stores like AWS S3 or Google Cloud Storage. Unlike traditional filesystems, these are orientated towards availability and redundancy at the expense of certain kinds of performance. However, using a format like parquet and caching parts of the data in memory, we can get achieve pretty good query performance over large data sets, e.g. 3s to run a group-by query across 7Gb of data.

AWS EFS

AWS’s Elastic Filesystem is a networked operating system that is effectively unlimited in storage, supports massive throughput speeds (faster than EBS) and can be mounted onto multiple machines at once using the NFS protocol. However, this comes at a cost compared to S3 and EBS, but it’s been invaluable for creating portable and persistent home/shared directories between the various coding environment.

However, AWS EFS support in k8s isn’t official and the solutions are a little out of date and didn’t work smoothly for me (I had to create new k8s serviceaccounts with required permissions as well as manually add subnets to the provisioned filesystem). I based our solution on this project.

This is almost exactly how computers don’t work

Compute

Kubernetes obviously supports running arbitrary compute off-the-shelf. The industry standard for processing large sets of data through a programmatic interface is Spark. While Spark 2.3 supports native kubernetes, it has yet to add language support for R or Python. For the time being, we are deploying Spark Standalone into Kubernetes (a subject for another post) so we can continue to use our usual RStudio and Jupyter workflows.

We’re also looking into Pachyderm as a general storage and compute solution for data science as it has some compelling features such as end-to-end data & code traceability (which they call provenance).

Also, for TensorFlow training, we’re looking at Kubeflow. This allows simple provisioning of TensorFlow clusters within Kubernetes so we can make use of spare capacity or simply scale out training compute.

Conclusion

While Kubernetes is rapidly becoming the defacto standard as a platform for cloud computing, many tools are not quite there yet. Once Spark provides Python and R bindings for Kubernetes and EFS becomes better supported, then I think we will have a great ecosystem to build upon.

--

--