Encourage Data Scientists to Smile

Hybrid ML Ops with HashiCorp Nomad

Published in

HashiCorp Solutions Engineering Blog

12 min readJul 17, 2020

The ubiquity of technology brings us to a place where professionals with skills pertaining to data science, artificial intelligence (AI), machine learning (ML), and Deep Learning (DL) are highly sought after. Due to its novelty, or perhaps that career paths are more business-centric compared to AI, ML, and DL, we see the job of a data scientist as the marquee position to hold for the 21st century [1, 2].

As it turns out, solving hard problems, working with smart people, and getting paid well to do so are attractive perks for those looking to break into data science! However, the lack of infrastructural support and best practices regarding operationalization, compounded by irrational expectations of an individual’s capabilities, as well as the immaturity of the hiring organization are powerful forces that push data scientists out the door before they begin to make any impact [1].

ML Ops intends to fortify the infrastructure and CI/CD surrounding ML models and, ultimately, aims to automate the deployment of those models to production [3, 4, 5, 6, 10, 11]. Organizations that hire — or are looking to hire — data scientists to mine, engineer, and analyze data, write model code, as well as actually implement the operational processes that get the code to end users should establish team requirements for effective data practice [4, 6, 10]. Furthermore, it is imperative for the organizations to understand where data scientists can drive the most value — represented by the green box in Figure 1. Once a capable team is in place, organizations need tools and workflows to drive value with ML Ops implementations.

Figure 1: *the sweet spot for data science.*

HashiCorp’s Nomad is a federated workload scheduler, with intrinsic flexibility that makes it an asset for building ML pipelines — especially in hybrid environments. In the spirit of this post’s title, we see it as a tool capable of bringing smiles to the faces of Data Scientists everywhere. We will explore Nomad in this role, and as a component of a fully automated and integrated ML Ops practice.

The Project

As best practices in ML Ops are still emerging [3, 4, 6, 10], we make no attempt to build an end to end solution here. Instead, we want to demonstrate where Nomad can fit amidst the moving parts of a fully automated, E2E ML pipeline. To do this, we need an initial project (represented in Figure 2) we can dirty our hands with: a pipeline for training & deploying model versions.

Figure 2: a pipeline for training and deploying model versions.

The subtasks of this project are:

Pull data from an external source
Transform the data into a usable form
Build a model in accordance with our overall ML goal
Train the model
Test the model against a subset of our data to perform a pseudo integration test on the model’s effectiveness.
Store the model (versioned) for future reference and use

Although we capture the predictions on the subset, we are most interested in saving a versioned model into a cloud bucket so we can leverage it from other sources in the future. The circle represented at the far right of Figure 2, represents our desire to iterate on this design as we move forward with an E2E solution.

The pipeline for our project is orchestrated via Nomad (this is why Nomad sits between the data and code layers of Figure 2). The reference architecture is a helpful guide for the system requirements of a cluster. For this project, we used the smallest recommended n1 series VMs in GCP. It is up to you whether you point and click the cluster into existence or leverage a provisioning tool. As an example, we used the Terraform configuration here.

The model we trained in our example takes a historical sample space of closed sales data and categorizes new sales opportunities as closed on a scale from 0 to 1. We use Salesforce to capture data regarding opportunities that closed in the past; the code for interacting with the Salesforce API leverages jsforce and is hosted in GCP cloud functions.

The model used does not matter because we are ultimately concerned with the operations required to deploy it. It is important to note we used TensorflowJS and the NodeJS runtime to build the necessary feature engineering, model training, and model code. We deployed these components into a Docker container available here. You can read more about Docker as it pertains to ML modeling and testing in [7, 8, 9].

The Nomad Config

The file we used to define the Nomad job specification is as follows:

job "sales-close-prediction" {  datacenters = ["dc1"]
  type = "batch"  // Send payload parameters that are passed in
  // Could also use a file here
  parameterized {
    meta_required = [
      "save_path",
      "data_path",
      "prediction_path",
      "data_url",
      "gcp_creds",
      "gcp_bucket_path",
      "model_version"
    ]
    meta_optional = ["batch_size", "epochs"]
  }  group "trainAndSaveModel" {
    count = 1    volume "models" {
      type = "host"
      source = "models"
    }    task "train" {
      driver = "docker"      // Expose the versioned model data via expected directory
      // See dockerfile for specifications
      volume_mount {
        volume = "models"
        destination = "/usr/src/app/models"
      }      resources {
        cpu = 3054
        memory = 1024
      }      lifecycle {
        hook = "prestart"
        sidecar = false
      }      env {
        VERSION = "${NOMAD_META_model_version}_${NOMAD_ALLOC_ID}"
        SAVE_PATH = "${NOMAD_META_save_path}"
        DATA_PATH = "${NOMAD_META_data_path}"
        DATA_URL = "${NOMAD_META_data_url}"
        PREDICTION_PATH = "${NOMAD_META_prediction_path}"
      }      config {
        image = "joshuanjordan/tensornode:0.7.1"
      }
    }    task "await_Model" {
      driver = "docker"      volume_mount {
        volume = "models"
        destination = "/tmp"
      }      // Waiting for the shared volume to be ready before we move on
      config {
        image = "busybox:1.28"
        command = "sh"
        args = [
          "-c",
          "echo -n 'Waiting for models at    \"${env["NOMAD_META_save_path"]}\"'; until [ -d /tmp/\"${env["NOMAD_META_save_path"]}_v${env["NOMAD_META_model_version"]}_${env["NOMAD_ALLOC_ID"]}\" ] 2>&1 >/dev/null; do ls /tmp; sleep 2; done"
        ]
      }      resources {
        cpu = 200
        memory = 128
      }      lifecycle {
        hook = "prestart"
        sidecar = false
      }
    }    // Storage related activity
    task "storeModel_GCP" {
      leader = true      driver = "exec"      env {
        TARGET_DIR = "${NOMAD_META_save_path}_v${NOMAD_META_model_version}_${NOMAD_ALLOC_ID}"
      }      volume_mount {
        volume = "models"
        destination = "/tmp"
      }      config {
        command = "/bin/bash"        // Store the versioned model
        args = [
          "-c",
          "gsutil cp /tmp/${env["TARGET_DIR"]}/model.json gs://${env["NOMAD_META_gcp_bucket_path"]}/${env["TARGET_DIR"]}.json"
        ]
      }
    }
  }
}

The Nomad job specification reference is helpful for those unfamiliar with the stanza definitions. In this example, we can see that the “sales-close-prediction” job is of type batch and consists of a group (trainAndServeModel) of 3 tasks: train; await_Model; and storeModel_GCP.

Each of the tasks runs in an execution environment known as a task driver. Train and awaitModel run with the docker driver and storeModel_GCP runs with the exec driver. As a prerequisite, docker was installed on our Nomad clients during the provisioning process. Because all GCP VMs are baked with the underlying SDK, there is no reason to pre-install it. However, if some other binary is needed with the exec driver, the binary needs to be on the Nomad client beforehand.

Under normal circumstances, the only guarantee Nomad offers us — unless we otherwise specify with the affinity parameter — is that all of the tasks grouped together will run on the same VM when the job is deployed; meaning the tasks will run in parallel post-deployment. We need these tasks to run in sequence, so we’ve added lifecycle hooks to incorporate this functionality.

For train and await_Model, we’ve leveraged the prestart hook without a sidecar. This stanza tells Nomad that both of these tasks will run in parallel before the main task starts and that they will not restart once finished. The main task, storeModel_GCP, is identifiable to Nomad via the leader = true parameter in the task specification. Nomad does not discriminate regarding how many tasks receive lifecycle hooks, but, as of now, there is no way to specify an order for tasks with hooks attached. Therefore, we have to configure one of our tasks, awaitModel, to wait for the existence of a file before it runs and subsequently triggers the storeModel_GCP task.

Sequential behavior is glued together by the use of shared volumes and volume mounts, which are specified at the group and task levels, respectively. At the group level, we are pointing to a pre-existing directory path on the Nomad client where we can share files across tasks. We can do similar things with Nomad’s built-in local and allocation task directories. However, since we are using the docker driver for two of our tasks, the ability to mount a shared volume into a container, use it, and then use it elsewhere, outside of a container, affords us less complexity and more consistency across various driver runtimes (i.e. the exec task copies files shared in the volume and stores them in a GCP bucket). The value of this capability is even greater at hybrid scale across cloud environments.

The last thing to note about the Nomad configuration is the parameterized stanza. This tells Nomad that when we run the job, we do not want to execute it immediately but, rather, queue it up in order to pass it arguments — such as a function call. When we actually call the job, we will dispatch it and pass meta data. The meta data is interpreted at runtime and, therefore, we can reference it in the job file via interpolation.

Because we want the ability to run more than 1 of these jobs in parallel (i.e. training and storing 1000 versions of a model), we are using the $NOMAD_ALLOC_ID parameter as a unique identifier. If we changed count = 1 to count = 2 in the group stanza, Nomad generates two allocation ids referenced via the $NOMAD_ALLOC_ID variable and interpolates across any of the tasks. Without going too deep into the details of the model code, know that the save path for the generated models is: models/$VERSION_$NOMAD_ALLOC_ID/model.json. We do this because we do not want 1000 executions of this job to only store 1 model version. If we do not use some unique identifier, we cannot store multiple versions of the model and we cannot meet the primary objective of our project.

The Steps

Now that we’ve uncovered the important options of our Nomad configuration file, we can perform the necessary steps to run the job defined. We already have an environment spun up and can use the UI (Figure 3) or the CLI (Figure 4) to verify all is well.

Figure 4: CLI verification of Nomad status

Now that we know Nomad is up and running, we can plan the job.

Figure 5: plan salesClosePrediction job and associated output

It’s not a requirement to run the job with a version verification, but it’s a reasonable copy/paste step to follow along with.

Figure 6: nomad run (index) with associated job output

The job was planned successfully and is running. However, because it’s a parameterized job, it doesn’t execute immediately. Remember, when we leverage a parameterized job, we expect to pass arguments to the job when we dispatch it. We’ve already abstracted the dispatch into a shell script.

##!/usr/bin/env bash
VERSION=$1
CREDS=$(cat /path/to/gcp/creds)
nomad job dispatch \
-meta save_path=trainedWinPrediction \
-meta data_path=./opportunityHistory.json \
-meta data_url=<gcp cloud function url> \
-meta prediction_path=./predictions.txt \
-meta model_version="${VERSION}" \
-meta gcp_creds="${CREDS}" \
-meta gcp_bucket_path=<gcp-bucket-name> \
sales-close-prediction

This crude script takes a single argument: $VERSION. We do this because we may want to tag our dispatches with the version we are training. This is true even if we run 1000 parallel allocations — we still want to reference the version we ran the batches against later on.

Next, we can dispatch the job (Figure 7). Although we could monitor the allocations and the status of the job from the CLI, the Nomad UI (Figure 8) is a cleaner panel to view the progress of the job.

Figure 8: the sales-close-prediction job running in the Nomad UI

Earlier we discussed how the $NOMAD_ALLOC_ID variable is used to identify separate runs of the same model version code. We can see this in action in Figure 9: both allocations run the same job group because they are using the same file — we specified this behavior by changing the count = 1 parameter in the Nomad configuration to count = 2 before we planned, ran, and dispatched the job.

Figure 9: the two allocations associated with our job groups

By clicking on either of the allocations, we can peer into the lifecycle status, as well as the resource utilization for the specified allocation. Remember that prestart runs before the main task, so the green shades on train and await_Model indicate that both are running — in parallel — before the storeModel_GCP task runs. Once the train task finishes, await_Model discovers the newly written model file in the shared volumes, closes, then triggers the storeModel_GCP task to copy the file into our GCP bucket specified in the dispatch script (Figure 11). As it pertains to compute utilization, we can see that the memory footprint should be adjusted before subsequent runs of the job.

Figure 10: the task lifecycle (prestart) & main

Figure 11: the models deployed into GCP — same version, different allocation ids.

Moving Forward

We have completed our goal. The project isn’t a production ready E2E pipeline, but it does represent Nomad as a player in the ML Ops process. In Figure 13, we see that Nomad has a target area (pipelines) where simple orchestration is valuable and the flexibility to operate the same way across multiple environments is needed. Because federation is a core service, Nomad can serve as the workload scheduler of choice — if the product develops the appropriate features.

Nomad is a simple, federated workload scheduler that scales across any environment. For an emerging field like ML Ops where the underlying work is already difficult and tough to coordinate, leveraging tools that do not narrow the complexity of operations doesn’t make sense. The tooling and workflows that organizations adopt must support whatever best practice looks like in the future. It’s too early to say where Nomad fits exactly in the completed puzzle of ML Operations, but it is clear that its role is valuable if simplicity, flexibility, and smiley data scientists are sought after.

Figure 13: Where Nomad fits in the overall picture of ML Ops.

Other Considerations Regarding Nomad

CSI plugins are currently in beta, but adds to Nomad the ability to share volumes across jobs. This is a tremendous value add if we were to take our example and share it across public and private clouds, alongside a myriad of client services (i.e. visualizations and dashboards) that need to consume the outputs. Scaling, another beta feature, allows us to preemptively adjust our cluster infrastructural scale via the job specification. This enables auto-piloted hybrid ML Ops at any scale.

References

[1] 2018. Here’s why so many data scientists are leaving their jobs. Medium. Johnny Brooks-Bartlett. Retrieved from https://link.medium.com/PkHeYkYiF7.

[2] 2020. Data Scientists Are The New Investment Bankers. Medium. Chris I. Retrieved from https://link.medium.com/HFYMyjbpF7.

[3] 2020. The Emergence of ML Ops. Cognitive World. Ron Schmelzer. Retrieved from https://www.forbes.com/sites/cognitiveworld/2020/03/08/the-emergence-of-ml-ops/

[4] 2020. ML Ops: Machine Learning as an Engineering Discipline. Medium. Cristiano Breuel. Retrieved from https://link.medium.com/AgwHVMiM97

[5] MLOps: Continuous delivery and automation pipelines in machine learning. Google. Retrieved from https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

[6] Rules of Machine Learning: Best Practices for ML Engineering. Google. Martin Zinkevich. Retrieved from https://developers.google.com/machine-learning/guides/rules-of-ml

[7] 2019. Build a Docker Container With Your Machine Learning Model. Medium. Tina Bu. Retrieved from https://link.medium.com/OskIj8OEB7

[8] 2020. Deployment could be easy — A Data Scientist’s Guide to deploy an Image detection FastAPI API using Amazon. Medium. Rahul Agarwal. Retrieved from https://link.medium.com/Vt9u2CtM97

[9] 2020. Don’t Learn Machine Learning. Medium. Caleb Kaiser. https://link.medium.com/fCzikiCM97

[10] 2018. What is ML Ops: Best Practices for DevOps for ML. Google Next. https://youtu.be/_jnhXzY1HCw

[11] 2019. ML Ops Best Practice on Google Cloud. Cloud Next. https://youtu.be/20h_RTHEtZI