The Shipyard — Part 2

Stephan Brown
Interos Engineering
5 min readJan 16, 2022

Engineering teams are too often encumbered by rote deployment tasks, which in turn takes away or limits the time available for designing and evaluating models. As part of our new ML Lifecycle workflow, we focused on addressing this challenge to streamline our processes and help our engineers spend more time on innovation. This blog is the second in our Shipyard series and introduces our Boarding Pass tool, which automates the onboarding of new models onto our platform, saving significant time and resources.

The Model Flywheel representation of our ML Lifecycle

Building the Boarding Pass

Previously, onboarding models involved running several different scripts, and updating our deployment manifest repos. These scripts created the project repository in GitLab, a container image repository in AWS Elastic Container Registry (ECR), and a space for model weights & datasets in S3. The final step, updating our deployment manifest repos, involved copy/pasting an existing model’s deployment manifests and using sed to replace model-specific values. This was an error-prone and time-consuming process.

Building the Boarding Pass mostly involved connecting several of those onboarding processes together and running them automatically; this included everything from creating resources in Amazon Web Services (AWS) to creating a GitLab repo for the responsible scientist or engineer to iterate on their model and scaffolding a project structure compliant with our model standards. The Boarding Pass integrates with many other platform tools such as the Model Promotion Pipeline, Data Feedback Loop, and our Application Performance Monitoring (APM).

The ML Platform uses a GitOps approach wherever possible to better support traceability and change management. All our deployments are stored in code. Boarding Pass uses the GitLab API to integrate with the Model Promotion Pipeline by making commits to our environment manifest repositories which contain the deployment manifest for each model. Boarding Pass interacts with other applications and tools in our environments, like Jira, AWS Elastic Container Registry (ECR) and Amazon Simple Storage Service (S3). The process effectively has multiple stages, which run in GitLab CI/CD pipeline every time there is a commit to the main branch.

Boarding Pass uses the GitLab API to create a repo and then commits a project structure seeded from the Model Project Template. The Model Project Template follows the naming convention deploy-model_name-model and includes the serving code, model code, and training code. This is the primary entry point for our ML Engineers to iterate on their model.

First, Boarding Pass uses boto3 to create a repository in ECR to store the Docker image, and sets the appropriate IAM resource policies and tag immutability.

def make_repo(project_name) -> int:
client = boto3.client("ecr", region_name="us-east-1")
print(f"creating ECR repository for {project_name}")
client.create_repository(
repositoryName=project_name, imageTagMutability="IMMUTABLE"
)
print(f"Setting repo policy for {project_name}")
client.set_repository_policy(
repositoryName=project_name, policyText=json.dumps(POLICY)
)

Next, Boarding Pass also uses boto3 to create an S3 object in our ML Platform bucket for storing the model weights and dataset.

def make_project(project_name) -> int:
print(f"Creating {project_name} in s3://{BUCKET_NAME}")
client = boto3.client("s3")
client.put_object(Bucket=BUCKET_NAME, Key=(project_name + "/models/"))
print(f"Created s3://{BUCKET_NAME}/{project_name}/models/")
client.put_object(Bucket=BUCKET_NAME, Key=(project_name + "/data/"))
print(f"Created s3://{BUCKET_NAME}/{project_name}/data/")

Lastly, the Boarding Pass uses Cookiecutter alongside the GitLab API to render the new model’s deployment manifests to the appropriate environment manifest repositories.

def template_repo():
cookiecutter_context = get_model_info()
log.info(cookiecutter_context)
cookiecutter(
"https://our.gitlab.host/model-project-template",
no_input=True,
overwrite_if_exists=True,
output_dir=r"/tmp",
extra_context=cookiecutter_context,
)
def template_manifest(env: str, aws_env: str):
cookiecutter_context = get_model_info()
project_slug = cookiecutter_context["project_slug"]
context = {"project_slug": project_slug}
template = "https://our.gitlab.host/manifest-template.git"
cookiecutter(
template,
directory=directory,
no_input=True,
overwrite_if_exists=True,
output_dir=rf"/tmp/manifest-{aws_env}-{env}",
extra_context=context,
)
A look at a successful GitLab pipeline run for onboarding a model.

Onboarding a new model is easy! Simply update the following fields in values.yml, and push the changes to the Boarding Pass repository’s main branch to kick off the pipeline.

project_name: My Cool Model
project_slug: my-cool-model
project_description: A model that does cool things for our customers
confluence_url: https://OUR_JIRA_SERVER/wiki/my-model
jira_url: https://OUR_JIRA_SERVER/browse/MLR-99999
author_email: *****@interos.ai

All of this data is parsed by Boarding Pass during a pipeline run and used in the pipeline stages.

The Benefits of a GitOps approach to Onboarding

There are a lot of benefits to this approach with onboarding new models. The biggest benefit to this is traceability. We can simply look at our commit history to find when a model was onboarded. We can also see the associated pipeline run and the logs for each stage of the pipeline, which is useful for identifying and quickly solving problems.

Challenges

One of the biggest challenges we experienced when building the Boarding Pass was deciding on how “self-service” the process should be. Should we allow our data scientists/ML Engineers the ability to directly onboard their own models, or should we leave it to the ML Platform team? We decided to forgo the idea of self-serve onboarding at this point to avoid potential problems, such as onboarded models not meeting our standards, or accidental changes to the pipeline, but we may revisit this in the future as demand for onboarded models increases.

Future Work

The initial iteration of Boarding Pass has created great value for our team, and we plan on automating additional aspects of the onboarding process in the future. Upcoming features, such as creating a Service in JIRA Service Desk (for incident reporting) and Slack channels for publishing model predictions and APM alerts, are on the roadmap for v0.2.0.

The bigger value-add is supporting additional custom boilerplates for other common model architectures we see our scientists create, which means less time modifying the templated model code once it’s onboarded. We’d also like to push metadata, such as base model architecture and original author, from the model into our model registry at creation.

Though we’re not yet finished with Boarding Pass, we’ve already experienced a 20x reduction in time spent on model onboarding tasks. What previously took 60 minutes on average now takes three, and the potential for human-error has been reduced significantly.

Stay tuned for our next post in The Shipyard series, covering our approach to automating model deployments & promotions with the Model Promotion Pipeline.

Interested in learning more? Ping us @interos on the ML Ops Community Slack! Want to apply your DevOps, machine learning, and DataOps skills to projects like Boarding Pass? Apply for our Senior Machine Learning Engineer, MLOps position!

--

--