Continuous Integration with AWS Sagemaker and CircleCI

Kaiyi Huang
Broadlume Product Development
3 min readJul 23, 2019

At AdHawk, we use AWS Sagemaker for training and deploying data science models used to send customers tips or “insights” into optimizing their web presence. Company-wide, we strive to follow a continuous integration/continuous delivery model of software development, so it was natural for us to also integrate this rigorous process into our data science team.

The pattern we follow is very similar to the approach illustrated by AWS. The difference is, rather than being completely coupled to AWS cloud-native services, we opted to leverage our existing CircleCI setup. More importantly, we introduced the concept of environments to allow for deploying to and falling back from different versions.

So, without further ado, here are the elements of our data science CI process.

Docker “Golden” Image

We use a single docker image as our build environment for interacting with the Python Sagemaker CLI and for deploying to an nginx server for local testing. This allows us to lock down versions and ensure a common environment for both local testing as well as deployment to AWS.

Our CI Pipeline also communicates with aliens

Handling Multiple Environments and Artifact Versioning

Since Sagemaker is a managed service without any concepts of virtual private clouds (VPCs), regions, or availability zones, we delineate environments based on naming conventions. We currently maintain a Staging and Production environment where two parallel sets of docker images, models, and Sagemaker endpoints reside. Our commits in Github serve as the source of truth for versioning. Thus, we suffix our image and model names with the Git commit hash and the staging endpoint with “-staging.”

Behold, rectangles

CircleCI Deployment Steps

Upon a feature being accepted and merged into master:

  1. Determine what models were modified in the latest commit
  2. Bake modified model into docker image and push to Amazon’s Container Registry with the commit tag as part of the name
  3. Make call to Sagemaker to train the modified model and store training output
  4. Deploy a web endpoint that serves inferences from the trained model

Promote to Production (upon a master commit being tagged):

  1. Find the image in ECR with the tagged commit
  2. Add the tag “latest” to that image

Areas for Improvement

Testing is conspicuously missing from our pipeline and is currently only done locally during the model development process. This is due to the non-deterministic results returned by the training model, which is inherent to inference-making. One workaround for this might be to have a feature where deterministic sampling of data is enabled so tests can be compared against expected results.

Conclusion

With this in pipeline in place, the data science team’s deployment tools and processes are aligned with that of the rest of the engineering team. Moreover, having multiple environments and syncing our deployed assets with version control means data science endpoints can be matched with the version and/or environment of the consuming application. All this allows us to rapidly and consistently iterate and serve our many data science models.

Purdy Picture (Flow Diagram of the whole process):

--

--