Achieving flexible, scalable and cost effective data science workloads

Published in

The Emburse Tech Blog

11 min readJul 28, 2022

How we built a best-practice architecture and tech stack

Abstract computer data graphic — Credit: iStockphoto/traffic_analyzer

Transforming an organization into a data-driven one requires making data science tools self-service and collaborative across all engineering disciplines, from image processing to workflow automation. However, data science is complicated. It requires unique workloads composed of many different components that usually require heavy compute and storage. This typically gets extremely expensive very quickly. From experimentation in notebooks to automated machine learning pipelines, data science workloads need to be flexible in order to meet needs of different frameworks and engineer preferences.

In this post we will discuss our Data Science Platform here at Emburse and how we achieve flexible, scalable and cost effective data science workloads. We will describe the architecture that makes all this possible while also describing the technology stack surrounding it.

Emburse’s data science platform tech stack — Emburse Data Science Platform

Architecture Overview

We are a very heavy AWS shop at Emburse. Very early on we decided that, like most organizations, to take the plunge into Kubernetes. Not because it’s what all the cool kids are doing but because it has so much to offer us. Being an AWS shop, it’s no surprise that we went with EKS instead of our own variation of Kuberentes.

I won’t be giving a primer here on “What is Kubernetes” because that can be found in an infinite number of sites across the web. What I want to focus on here are the benefits we reap from using Kubernetes architecture and the benefits of using EKS in particular.

Scalability and Flexibility

Node Groups

Our Data Science team currently requires the need to run large Spark clusters for training, data analysis and experimentation (PySpark). We also have the need to run on GPU for training Deep Learning models. We are also providing the ability to run other frameworks such as Dask to help with analysis and training although most currently prefer Spark.

We are currently able to handle scaling requirements for these workloads by utilizing Node groups for each of our different workloads. Node groups, paired with Cluster Autoscaler which is discussed more later, allows for us to provide a diverse set of Nodes that serves a specific set of workloads. It’s within these Node groups that we are able to define hardware requirements (CPU, Memory, GPU, disk, etc …), whether On-Demand or Spot, Node labels and most importantly taints and tolerations. These taints and tolerations allow for us to get the most out of these nodes and allow ONLY for a specific workload to run on them.

Let’s take a look at our Spark Executor Node group as an example:

apiVersion: v1
kind: Node
metadata:
  labels:
    node.kubernetes.io/instance-type: m6i.4xlarge
    owner: datascience
    role: spark-executor
spec:
  taints:
  - effect: NoSchedule
    key: dedicated
    value: spark-executor

Above is a snippet of a Spark Executor node. As you can see this particular node is given a specific label with “role” and also has a taint with “NoSchedule” with a specific key value. You can also see the label used to instance type which fulfills hardware requirements. With the Node groups above we are then able to use a NodeSelector along with a toleration within a pod, deployment, statefulset, job, etc… this will ensure that only the Spark Executor workload is run on that particular node when a pod is created. Also important to note is what you specify for resource requests and limits but more on that later when we discuss Cluster Autoscaler. Below is an actual podTemplate we use for this Spark Executor Node group:

apiVersion: v1
kind: Pod
metadata:
  labels:
    role: spark-executor
spec:
  nodeSelector:
    role: spark-executor
  tolerations:
    - key: "dedicated"
      operator: Equal
      value: "spark-executor"
      effect: "NoSchedule"

As a team, We are responsible for not only building and training models but also responsible for serving them. This could be done in batch or by a long living API service. As a lean team we are also responsible for building any API that needs to sit in front of our models as some preprocessing is sometimes needed. These APIs need to be highly available and reliable under heavy load. Using Kubernetes provides us with two key components for high availability. It provides us with self healing pods and provides us with Horizontal Pod Autoscaling.

If a Node or a pod goes into an “Unhealthy” state, Kuberetes will handle redeploying our service until a “Healthy” state is achieved automatically and requires no interaction from a user. This includes during deployments as well. Kubernetes will never allow for too many of our pods to go into an Unhealthy state before terminating older pods.

High Availability and Pod Autoscaling

When our model services are under heavy load and need to scale up additional pods to handle the heavy load, we rely on the Horizontal Pod Autoscaler (HPA). We base scaling these pods on CPU utilization because of the computations needed when our models are making predictions. Below is an example HPA for one of our models:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model
  namespace: ml-model
  labels:
    app: ml-model
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model
  minReplicas: 3
  maxReplicas: 15
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

If CPU utilization begins to hit close to an average of 70%, we allow for HPA to deploy additional pods to handle the CPU load and back down again when utilization is down.

IAM Roles for Service Accounts

As mentioned, we are a heavy AWS shop. Our team utilizes not only EKS, but also S3, EFS, ECR and we also write data to places like Dynamo and RDS. EKS makes this a very simple interaction and provides permission to these services by enabling IAM Roles for Kubernetes service accounts (IRSA). EKS allows for each cluster to have its own OIDC and then through creating IAM policies and attaching them to IAM roles, service accounts are then able to assume IAM roles. These roles then provide the necessary permissions to other AWS components. This prevents the need to create AWS Secrets and Keys whenever a pod needs permission to a different AWS Service.

Cost Savings

Scale from 0 and Spot Instances

These heavy Spark workloads and nodes with GPUs attached are only required to be running as long as they are needed. With our EKS node groups mentioned above we are also able to specify and begin scale up from 0 instances and back down again once the workload is done. This is really a big win for us because our biggest costs come from infrastructure and this ensures that we are only using resources when they are needed.
Some of our workloads, such as Spark Executors, can even run on AWS Spot instances instead of On-Demand. If a spot node gets terminated, Spark has a way to recover the executor and respawn the pod on another node without killing the workload.

Jupyterhub

Our team has been working diligently to move to Jupyterhub and off of Databricks for experimentation and Notebooks. Jupyterhub has been proven time and time again to provide us with the most flexible, customizable and reliable Notebooks for our team. We create individual profiles that include different frameworks for running things like Spark clusters, Deep Learning and also interacting with our Experiment tracking system. We even place the Jupyterhub instances inside of a Node group and scale from 0 instances so the Notebooks are only online when the users are using them. From a security perspective, this also keeps all of our code and data in house.

Bin Packing

Because we use products such as cluster autoscaler, we are able to “bin pack” as many pods on worker nodes as possible. If the cluster becomes starved for resources, cluster autoscaler will handle adding additional nodes to handle the resource needs. It will add ONLY the amount of resources needed in that moment and remove any nodes that are no longer needed.

Open Source Software

All components and products throughout our stack are open source. The only cost that we acquire is the infrastructure needed to run our workloads. Most vendors charge a fortune for products but we are able to achieve all of our work utilizing and contributing to open source software.

Our Support Model

It would be a disservice not to mention the support model that we have here at Emburse as it contributes very much to achieving success. We have a very skilled Operations team that is responsible for infrastructure and infrastructure like components running on EKS. For example, the Ops Team uses the EKS Terraform Module from AWS to manage our EKS clusters. The repo for this is maintained and deployed by our Ops Team. This team also maintains cluster wide components such as Cluster Autoscaler and Gloo Edge.

This support structure is really awesome for our team and takes so much of the burden away from us. We are able to focus directly on only things necessary for our team and prevents us from having to worry about underlying Infrastructure, logging, metrics gathering, etc… We are able to utilize their offerings instead of providing our own. We would need a much larger team if it were not for them.

Components Overview

In this section we will discuss the Data Science stack and the current components we are using to achieve things such as experiment tracking, training pipelines, model registry, feature store, deployment, etc… All of these components were carefully chosen to ensure simplicity for management and also from lessons learned. We have been developing the stack for more than a year now and things are constantly changing in the way that we work and from the things that we experience. The biggest thing we have learned for certain is that making changes in the way we work in our current environment has been easily done due to the flexibility provided in our platform.

Deployment

We take a very heavy “GitOps” approach to manage all of our components and for also managing all the custom apps that we build. Code is and should be a source of truth for everything. To help us with this we chose ArgoCD. With ArgoCD we create application definitions that tie to a git repository and watch for changes on a specified branch. When ArgoCD detects a change (becomes “OutOfSync”) in the desired state it will deploy those changes automatically. Most of our manifest files are written in Kustomize so that we can very easily reutilize certain files over all environments and only modify files specific to environments. Below is an example of an application manifest file for one of our models:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-models
  namespace: argo-system
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: datascience
  source:
    repoURL: git@github.com:emburse/ModelRepo.git
    targetRevision: HEAD
    path: $OVERLAYS/production
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-models
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: true
    syncOptions:
    - Validate=true
    - CreateNamespace=true
    - PrunePropagationPolicy=foreground
    - PruneLast=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Training Pipelines

Keeping with the same ecosystem, we decided to use Argo Workflows as our tool used for automated pipelines, building docker images, uploading scripts/files to S3 and anything else that comes up that needs to be automated. After doing research for a short time it was easy to see why organizations choose Argo Workflows and it also has close ties to Kubeflow which is a popular tool used for Data Science teams for training and serving models. Below is an image of an example training pipeline:

Emburse’s Argo Workflows Pipeline — Argo Workflows Pipeline

From the image you can see that our pipeline takes the following steps:

Preprocess our training data.
Training. In this case we are training multiple models at the same time.
Evaluate the best model and register it to the Model Registry.
Build a docker image with new features and code changes.
Deploy the new Code and Model from the Model Registry to our Development environment based on the feature branch being used (using ArgoCD).
Run some tests.

R&D and Experimentation

As mentioned above, we have been working on migrating from Databricks to our own self hosted Jupyterhub. Jupyterhub provides our team with customized and reliable Notebooks with all of the python libraries and integrations into our other tools as necessary. We have already had many users migrate and completely use Jupyterhub for all experiments. Users have been very satisfied with the reliability and the ease of use. Below is an image of some of our provided profiles:

Emburse’s Jupyterhub Server Options — Jupyterhub Server Options

Experiment Tracking

Recently we have made the decision to move away from MLFlow in favor of Aimstack. The Aimstack architecture fits in much better with our Kubernetes architecture and is much more portable. Aimstack comes with a very snappy and nice looking UI that allows for users to simply query through runs for metrics while also adding this functionality to an easy to use SDK. We found that users were able to learn to query metrics much more easily using Aimstack than MLFlow and it has native support for running on Kubernetes. We host both a remote tracking server and UI in the same instance. I highly recommend checking out the live Demos from their site and this repo for trying it out for yourself.

Model Registry

Making the decision to move away from MLFlow means that we also move away from using it as a Model Registry. We have had several issues over the last year when upgrading versions of MLFlow where certain APIs broke and it became too much of a task to manage. We were not able to easily keep our dev and prod MLFlow registries in sync and became fearful of not being able to easily recover from a disaster. We also learned that MLFlow is likely a bit more than what we need in that we don’t need to register 100s of models a month across dozens of teams and users.

As a replacement, we decided that we would build our own Model Registry on top of Artifactory that is managed by our Ops team. This will allow greater flexibility and control for how we wish to version control our models. For most of our models, we have learned that the easiest way to version control them is to actually bake them directly into the Docker image. This is totally dependent on the size of the model of course because we would like to keep the size of our Docker images as light as possible.

Feature Store

This is still under investigation and has not been an immediate need until recently. We are finding that as we are working with other teams to train models that there are differences in the source of data being used. A feature store would ensure that our features are the same across teams as well as across newer versions of models.

The main feature store that is being considered is Feast. Feast has many integrations with many of the products that we already use such as Snowflake and S3.

Automated ML

Automating training of new models is something that our team is pushing to accomplish this year. This of course becomes much easier with a feature store. The idea is to kick off a training pipeline based on a time or as new features are added. Our training pipelines would then train new models with new data and also automatically deploy a new version for us to begin testing.

Conclusion

The Data Science Platform at Emburse is indeed a suite of centrally managed open source applications and tools that are widely accessible across the engineering organization. Although the data science team focuses much of their efforts on data products and services to delight and empower customers, the Data Science Platform at Emburse enables engineers across the organization to access tools and compute as well as existing frameworks and machine learning approaches to accelerate innovation and the achievement of the Emburse mission of humanizing work everywhere.

Sounds interesting? Learn more about us and check out our open positions at emburse.com/careers.