The Shipyard — Part 1

Published in

Interos Engineering

9 min readDec 15, 2021

Our north star is to map, model, and monitor global supply chains in real-time. Given the sheer scale of the problem, it’s no surprise that a human, analyst-led approach faces limitations in scope and scale. Machine learning poses a viable [although challenging] means to surface data insights at scale and continuously over time by ingesting, tagging, and calculating incredible amounts of structured and unstructured data. With the average global brand’s supply chain consisting of tens — if not hundreds — of thousands of suppliers, machine learning is an essential technology for identifying real-time risks and visibility across such complex networks.

Last year, we set out to build a best-in-class Machine Learning Platform, offering our engineers the flexibility to iterate on models quickly, and to easily package and publish these models in a way that our data ingestion pipeline could leverage their predictive power.

As the ML Platform team approaches our one-year anniversary, we’d like to take a moment and reflect on the past twelve months, evaluating what we’ve built and sharing our lessons learned with the broader engineering community. This blog post is the first of many in a series we’re calling “The Shipyard”, a nod to our team’s maritime iconography. Each post in the series will detail a component of our ML Platform, describing what it does, how it works, and why we built it. To kick off the series, we’ll be exploring Project formats and automated compliance!

Our Guiding Light

Before diving into project formats, we’d like to introduce a concept that is central to our thinking: The Model Flywheel.

The Model Flywheel is the general cycle of training, testing, packaging, deploying, monitoring and maintaining models. Your enterprise buddies might call it a “model development lifecycle” (though it lacks any decommissioning/retirement step), and maybe they wouldn’t be wrong. However, “flywheel” connotes the momentum and continuous improvement required to transform developer and user experience.

We think it looks a little something like this:

By using The Model Flywheel as a starting point, we center our design around developer experience, and ensure design decisions are made with empathy for end users [even if it requires a little more Ops-work]!

This abstraction also makes it easier to categorize and group functionality into distinct “products”, many of which we’ll cover in future Shipyard posts:

Model Project Template (in blue)
Boarding Pass (in green)
Model Promotion Pipeline (in pink)
Data Feedback Loop (in teal)
Experimentation Pipeline (in yellow)

Our project format

The source code for every machine learning model project on the ML Platform looks like this:

$ tree . -I venv           
.
├── README.md
├── .gitlab-ci.yml
├── model.py
├── train.py
├── evaluate.py
├── serve.py
├── openapi.yaml
├── requirements.txt
├── tests
│   ├── conftest.py
│   ├── test_evaluate.py
│   ├── test_serve.py
│   ├── test_model.py
│   └── test_train.py
├── datasets
│   ├── eval.csv
│   └── train.csv
└── weights
    ├── classifier
    │   ├── config.json
    │   ├── pytorch_model.bin
    │   └── tokenizer
    │       ├── README.md
    │       ├── config.json
    │       ├── model_hash.txt
    │       ├── pytorch_model.bin
    │       ├── tokenizer.json
    │       ├── tokenizer_config.json
    │       └── vocab.txt
    └── entity-extraction
        ├── config.json
        ├── config_ner.json
        ├── pytorch_model.bin
        └── tokenizer
            ├── README.md
            ├── config.json
            ├── model_hash.txt
            ├── pytorch_model.bin
            ├── tokenizer.json
            ├── tokenizer_config.json
            └── vocab.txt

README.md contains useful boilerplate snippets for running the model locally, along with links to useful resources, including:

links to model weights and training datasets in S3
links to the built container images in ECR
links to the live deployments in each environment
links to live deployment Logs in Datadog
links to live deployment APM dashboards in Datadog
links to the parent Jira ticket/epic and Confluence document for the model project

.gitlab-ci.yml contains a reference to our centralized CI pipeline (the Model Promotion Pipeline), as well as a path to the timestamped model weights in S3

This is how each Release of a model is associated with a given set of model weights

model.py contains the actual implementation code for the model. These definitions are imported into train.py, evaluate.py, and serve.py as-needed.

train.py is the training script, run by the Experimentation Pipeline to generate a new set of model weights, given a user provides:

A training dataset
An evaluation dataset
(Optionally) an existing set of model weights to fine-tune from

evaluate.py is the evaluation script, run by the Experimentation Pipeline to evaluate a newly-trained model weight set.

serve.py is the actual API implementation in kserve

Users subclass kserve.KFModel, and implement the load() and predict() methods

openapi.yaml contains the OpenAPI spec for the API implemented in serve.py

requirements.txt lists all the dependencies required to run the project

Since serve.py is the application that runs in the target deployment environment, we encourage strict pinning

tests/ contains the test suite

These tests are run in the Model Promotion Pipeline, where we can set minimum coverage gates

datasets/ is an empty folder that we .gitignore, but is useful to keep local developer environments organized.

The README.md contains snippets for aws s3 cping the latest data set into datasets/

weights/ is an empty folder that we .gitignore, but is useful to keep local developer environments organized.

The README.md contains snippets for aws s3 cping the latest model weights set into weights/

Standards

By now you may be thinking “wow, governance is an interesting place to start when talking about building an ML Platform”. And while it is by no means the most fun part of our job, the benefits can’t be overstated.

These standards act like an interface, promoting interoperability (an engineer on one project can understand/support other projects very easily), and enabling us (the Platform team) to design/implement/rollout new functionality that works seamlessly with all models on the Platform.

Documenting standards

The decisions made in designing our project template format are described in a Model Standards document. This document (and the standards within) are collectively maintained by all stakeholders and stewarded by the ML Platform team. They represent a collective agreement that “this is the way things should look”.

Some select items from our current Model Standards include:

1. Model weights must be stored in s3://ml-platform-bucket/<PROJECT_NAME>/models/<TRAINING_TIMESTAMP_IN_ISO_8601_FORMAT>
…
4. Inference logic must be implemented in serve.py by subclassing kfserving.KFModel or kserve.KFModel
…
6. Built container images must have values for the following labels defined:
author, version, model, org.label-schema.version, org.label-schema.name, org.label-schema.build-date, org.label-schema.vcs-ref, org.label-schema.vcs-url, org.label-schema.vendor, org.label-schema.schema-version
…
10. serve.py must ddtrace.patch(tornado=True) to support Datadog APM Tracing

Enforcing standards

Establishing standards is a great start, but the real challenge is enforcing adherence to those standards.

Just because we all agreed to do something doesn’t necessarily guarantee mistakes won’t get made, things won’t be forgotten, etc. This rings especially true when the agreed-upon mandates are small technical details among complex codebases.

As a team with Operational [read: On-call] and Support responsibilities, we know the cost of confusing error messages. Developers get tired of opening support tickets for issues they could ultimately resolve on their own. And Platform Engineers feel burned out when constantly context-switching between service desk tickets and their actual work.

Acknowledging the human elements of the problem, and the high ROI of automating compliance checks, we built snitch.

snitch is a small Python utility that assesses whether model projects adhere to all the standards in the Model Standards document.

Running the compliance checks is as easy as pip installing snitch (from our private PyPI), and running snitch -n {MODEL_NAME}

$ snitch -n iris
Running snitch on iris model at /Users/josephhaaga/Documents/Code/Models/deploy-iris-model, using tests from /usr/local/lib/python3.8/site-packages/snitch
====================== test session starts ========================
platform darwin -- Python 3.8.2, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /Users/josephhaaga/Documents/Code
collected 35 items                                                                                                 src/snitch/test_standard_1.py ..                             [  5%]
src/snitch/test_standard_10.py ..                            [ 11%]
src/snitch/test_standard_2.py ss                             [ 17%]
src/snitch/test_standard_3.py .                              [ 20%]
src/snitch/test_standard_4.py .....                          [ 34%]
src/snitch/test_standard_5.py ....                           [ 45%]
src/snitch/test_standard_7.py .........                      [ 71%]
src/snitch/test_standard_8.py s.....                         [ 88%]
src/snitch/test_standard_9.py ....                           [100%]More information about these tests (and how to pass them) is available in the Model Standards document on Confluence 
(https://interos.atlassian.net/wiki/spaces/MLOPS/pages/928186641/Model+Standards).================== 32 passed, 3 skipped in 0.98s ==================

Under the hood, each standard is codified into our test suite, and a simple Click CLI runs pytest on the target directory.

For example, a portion of src/snitch/test_standard_4.py looks like this:

# Inference logic should be implemented in serve.py by subclassing
# kfserving.KFModel or kserve.KFModelimport ast
import pathlib
from typing import Listimport pytest# fixtures omitted for brevitydef test_serve_py_defines_subclass_of_kfmodel(subclasses_of_kfmodel_in_serve_py: List[ast.ClassDef]):
    assert (
        len(subclasses_of_kfmodel_in_serve_py) > 0
    ), "No subclass of kfserving.KFModel or kserve.KFModel found in serve.py"
def test_serve_py_implements_load(subclasses_of_kfmodel_in_serve_py: List[ast.ClassDef]):
    assert any(
        [
            (isinstance(function, ast.FunctionDef) and function.name == "load")
            for cd in subclasses_of_kfmodel_in_serve_py
            for function in cd.body
        ]
    ), (
        "Subclass of kfserving.KFModel or kserve.KFModel in "
        "serve.py doesn't implement load()"
    )
def test_serve_py_implements_predict(subclasses_of_kfmodel_in_serve_py: List[ast.ClassDef]):
    assert any(
        [
            (isinstance(function, ast.FunctionDef) and function.name == "predict")
            for cd in subclasses_of_kfmodel_in_serve_py
            for function in cd.body
        ]
    ), (
        "Subclass of kfserving.KFModel or kserve.KFModel"
        "in serve.py doesn't implement predict()"
    )

And if I decide to break a rule (say, Model Standard #4) by deleting predict() in serve.py, I would see the following:

$ snitch -n iris
Running snitch on iris model at /Users/josephhaaga/Documents/Code/Models/deploy-iris-model, using tests from /usr/local/lib/python3.8/site-packages/snitch
======================= test session starts =======================
platform darwin -- Python 3.8.2, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /Users/josephhaaga/Documents/Code
collected 35 items                                                                                                 src/snitch/test_standard_1.py ..                                                                             [  5%]
src/snitch/test_standard_10.py ..                                                                            [ 11%]
src/snitch/test_standard_2.py ss                                                                             [ 17%]
src/snitch/test_standard_3.py .                                                                              [ 20%]
src/snitch/test_standard_4.py ....F                                                                          [ 34%]
src/snitch/test_standard_5.py ....                                                                           [ 45%]
src/snitch/test_standard_7.py .........                                                                      [ 71%]
src/snitch/test_standard_8.py s.....                                                                         [ 88%]
src/snitch/test_standard_9.py ....                                                                           [100%]============================= FAILURES =============================
_________________________________________ test_serve_py_implements_predict _________________________________________subclasses_of_kfmodel_in_serve_py = [<_ast.ClassDef object at 0x1041103a0>]    def test_serve_py_implements_predict(subclasses_of_kfmodel_in_serve_py: List[ast.ClassDef]):
>       assert any(
            [
                (isinstance(function, ast.FunctionDef) and function.name == "predict")
                for cd in subclasses_of_kfmodel_in_serve_py
                for function in cd.body
            ]
        ), "Subclass of kfserving.KFModel or kserve.KFModel in serve.py doesn't implement predict()"
E       AssertionError: Subclass of kfserving.KFModel or kserve.KFModel in serve.py doesn't implement predict()
E       assert False
E        +  where False = any([False, False, False])src/snitch/test_standard_4.py:75: AssertionErrorMore information about these tests (and how to pass them) is available in the Model Standards document on Confluence 
(https://interos.atlassian.net/wiki/spaces/MLOPS/pages/928186641/Model+Standards).
===================== short test summary info =====================
FAILED src/snitch/test_standard_4.py::test_serve_py_implements_predict - AssertionError: Subclass of kfserving.KF...
============= 1 failed, 31 passed, 3 skipped in 1.32s =============

snitch has become a fundamental part of our CI/CD pipeline (the Model Promotion Pipeline) and could easily be implemented in your CI/CD tool of choice. The portability of a pip installable tool is an advantage, allowing users to run compliance checks locally (reducing “shotgun CI”), and even to integrate snitch into their existing nox or pre-commit workflows.

Fostering compliance

Now we’ve got standards and a way to enforce them, but how do we encourage adherence without irritating users? For this, we draw inspiration from high school chemistry (or, at least, our recollections of it):

Fortunately, it’s possible to lower the activation energy of a reaction, and to thereby increase reaction rate. The process of speeding up a reaction by reducing its activation energy is known as catalysis, and the factor that’s added to lower the activation energy is called a catalyst.
- KhanAcademy’s AP Biology course

As such, we built a cookiecutter template called the model-project-template, which renders a compliant model project repository structure, and uses template variables to fill in the appropriate values as-needed. This starts new projects off “on the right foot” and saves engineers valuable time that could be spent developing new models.

$ cookiecutter https://path.to/model-project-template/on/gitlab# A nice name
[Project Name]: Great Blog Post Detector# Brief description (for openapi.yaml and README.md)
[Project Description]: Detect whether a blog post is great# Link to project page on Confluence
[Confluence URL]: https://path.to/project/page/on/confluence# Link to project Story or Epic on Jira
[Jira URL]: https://path.to/project/epic/on/jira# Email address of Directly Responsible Individual
[Author Email]: owners-email@interos.ai

By making adherence seamless and the “default” way of doing things, we can all-but-guarantee the majority of projects will “play by the rules”. In fact, users would have to go out of their way to break the rules, since they are already compliant right out of the gate!

Summary

By first establishing the standards (and making it easy to comply with them), we’ve now decoupled the ML Platform engineering efforts from the model-development efforts. The ML Platform team is free to continue building ML Platform functionality, while ML Engineers can set forth and train, test, build, and deploy models, confident that any new ML Platform functionality will work with their models in a turn-key manner.

Keep an eye out of The Shipyard — Part 2 as we continue to detail the what, why, and how of our ML Platform. In our next post, we’ll discuss Onboarding a new model, taking a closer look at Boarding Pass and the Model Project Template. Stay tuned!

Interested in learning more? Ping us @interos on the ML Ops Community Slack! Want to apply your DevOps, machine learning, and DataOps skills to projects like snitch? Apply for our Senior Machine Learning Engineer, MLOps position!