What do DevOps and Machine Learning Have in Common?

James Dobson

Published in

Speechmatics

4 min readMay 16, 2019

DevOps and Machine Learning

What do DevOps and machine learning have in common?

Pipelines
Data storage/Manipulation
Orchestration
Scaling
Packaging

All these things are in traditional microservices and what you experience for building/training a machine learning component. Any DevOps person that has been involved with these issues can relate quite well to these areas where ‘Big Data’ needs scalable solutions.

Data Warehousing/Data Lakes

For a good explanation of data warehousing.

There are many solutions to the problem of storing large volumes of data for machine learning, it’s no different than other types of storage issues with large scale web applications/social media sites or even Twitter-like feeds. The problems may be explained in different terms and different toolkits exist to help, but the problems are essentially the same.

“I need storage that scales, can perform well at task X, and that supports some kind of metadata.”

Traditional DevOps Problems

“How do I store all these images uploaded by our user base in a way that allows me to return/display them in the PB range?”

“How do I query all our metadata to work out what other people have bought when looking at this item?”

Traditional Machine Learning Problems

“How do I store all this information in a way that gives me relatively quick access in the PB range?”

“How do I transform my data so that I can train on it while in the PB range?”

Toolkits

Machine learning toolkits like TensorFlow, Spark, Torch and Kaldi can help you with specific parts or be good at specific types of machine learning models. However, they can also restrict you and mean you could have many different workflows to produce either a single or even multiple models.

Some toolkits have scaling implemented using certain job engines (SGE, Kubernetes, Nomad). Again, it may require some investigation into how to scale for these environments. Using shared storage for job queues can be your biggest single bottleneck, yet be the simplest to set up.

Some Examples

Nomad

Wrapping a Nomad call to run a single normalisation task as a specific job (you could have a single job with multiple tasks). The idea is that it can submit thousands of these jobs and do them and if any specific one(s) fail, you can see investigate (but this ideally doesn’t stop the pipeline).

#!/usr/bin/python

import uuid
import os
import sys
import subprocess

SUBMIT_JOB = ["nomad", "job", "run", "-address=http://127.0.0.1:4646", ""]

NRMLJOB = """job "<PARENT>_normalise_task_<HASH>" {
  datacenters = ["dc1"]
  type = "batch"

  group "<PARENT>" {
    count = 1
    task "normalise" {
      driver = "docker"
      config {
        image = "nomadal:10"
        command = "/bin/bash"
        args = ["-c", "python norm.py <DBL> <LANGUAGE> <WORKDIR>"]
        volumes = [
        "/nomad/workdir:/root/ME"
        ]
      }

      resources {
        memory = 512
        cpu = 200 
      }
    }
  }
}
"""

parent = sys.argv[1]
dbl = sys.argv[2]
lang = sys.argv[3]
workdir = sys.argv[4]

JOBTEMPLATE = NRMLJOB
JOBTEMPLATE = JOBTEMPLATE.replace("<PARENT>", parent)
JOBTEMPLATE = JOBTEMPLATE.replace("<DBL>", dbl)
JOBTEMPLATE = JOBTEMPLATE.replace("<LANGUAGE>", lang)
JOBTEMPLATE = JOBTEMPLATE.replace("<WORKDIR>", workdir)
jobid = str(uuid.uuid4().hex)
JOBTEMPLATE = JOBTEMPLATE.replace("<HASH>", jobid)


with open("%s.yaml" % jobid, "w") as jt:
    jt.write(JOBTEMPLATE)
SUBMIT_JOB[-1] = "%s.yaml" % jobid
print SUBMIT_JOB
try:
    subprocess.check_call(SUBMIT_JOB)
except Exception as ex:
    pass

SGE

The most obvious example is the Kaldi queue.pl.

Kubernetes

This has a ‘batch’ job system that you can take advantage of. This is the example given on their example documentation.

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

Final Words

What is the best solution? Well, none really!

Try and find the best path based on toolsets, engineers and the requirements. Always start with a complex enough pipeline that you can prototype with confidence. It might take you longer to get to a final solution, but it’s one where all the issues are understood.

Another approach can be iterative, but you must accept that at any point you can drop a toolset/code or change/adapt the workflow and although not entirely ‘agile’ it can fit agile better than a ‘big’ rewrite.