The Wehkamp ML platform

Published in

wehkamp-techblog

6 min readJan 9, 2019

Last Summer, we were looking for a tool to help us serve our Machine Learning models. Amazon Sagemaker was a little too much tied into the AWS platform and we were looking for something slightly more flexible. We came across Clipper that does exactly what we want: build and host containers. Read the original story in which we share what we uncovered and discovered as part of working out a simple proof of concept.

Today it’s time for a follow-up to share what we created so far: the Wehkamp ML platform.

The following diagram should give a fair overview of how we move from training to production. We specifically chose to keep workflows and tooling close to how regular software (like micro-services) is shipped and created, as the ML-domain should really not be something special at all — ultimately making it a lot easier to adopt across the whole organisation :)

At Wehkamp we’re heavy Databricks users. It’s our tool of choice when we need to work with vast amounts of data and so creation of models usually begins in a Databricks Jupyter-like notebook. During the training phase we track how different iterations of a model behave using MLflow. Once satisfied with the results, the code results in a model that we want to push to the world. As with regular development workflows, we now commit our code to GitHub and store the sha1sum() of the model that got created as a version in the project’s MLproject file. This file is part of the MLflow specifications and keeps track of all metadata.

The Wehkamp micro-services platform is build around the principle of self-describing services. Services are configured through metadata that is injected into the platform (specifically: Consul) alongside a deployment, generally described in the Dockerfile of a service. This gives great flexibility and is a prime example of how we always strive to enable our teams for autonomy in their pursuit of world domination. As the ML domain should be treated in a similar fashion, we also apply this metadata concept to the models. This means we extended on the MLproject yaml file to also hold some of the metadata specific to our platform.

name: blaze-visual-similarity-model
blaze:
  service:
    id: visual-similarity
    team: rnd
    description: "Returns a list of (visually) similar products"
    main-language: python
    routing:
      consumer:
        exposed: False
    ml:
      model:
        version: 4f1255f50f5f32e93b4fb69704d23f466e468ebd
        pickle: true
        input_type: doubles
        slo: 100000
        default_output: "0.00"
        pkgs:
          - scikit-learn
          - pandas

With the above we can instruct the routing layer of the platform, like we do for regular services, define input for dashboard-creation and also define several model-specific parameters like additional packages to install. All in a familiar way, in a familiar location.

Now imagine you’re done crunching your data-set, ready to build the container that will return some kind of pre-computed recommendation. The output of your model is a large json file that you want bundled in your Clipper pkl. Below is an example of doing just that:

from clipper_admin.deployers.deployer_utils import save_python_function
import pandas as pd
import glob
import ospath_sim_pro = "list_of_similar_products.json"def read_json_folder(path_folder):
    json_files = glob.glob(os.path.join(path_folder, "*.json"))
    df_subs = [pd.read_json(os.path.join(path_folder, f), lines=True) for f in json_files]
    df_sim = pd.concat(df_subs, ignore_index=True)
    df_sim.set_index('productNumber', inplace=True)
    dict_sim = df_sim.to_dict('index')
    return dict_simdef get_similar_list(prod_numbers):
    return [str(dict_pro[int(p)]['similarProducts']) for p in prod_numbers]dict_pro = read_json_folder(path_sim_pro)save_python_function(None, get_similar_list)

The model that got build and is to be packaged up into a container is nothing more than a version field in the metadata, closely resembling something as simple as a library version.

So at this point we have a GitHub repository that contains some code and a pickled model sitting in S3. The model version to use is set in the MLproject file that contains all metadata describing the model.

If this was a regular micro-service, Jenkins would respond to a commit to this repository and start building a Docker container. We have our own glue to make this building a container happen, called blaze-cli. What we did next was introducing a driver to this tooling layer that can take care of projects that feature an MLproject file, as these are machine learning projects. Since our tooling is mostly build around Python, implementing Clipper (also Python) was pretty straightforward. And doing it this way means we don’t need additional tooling apart from just the Clipper specifics. Developers working with this setup don’t need to change their workflows. As per usual, they’ll just do an ordinary blaze build to build their containers — or have Jenkins do that on their behalve. Clipper will create a container and ship that to Docker Hub, just like with any ordinary micro-service.

Once building is done, it’s time to ship. This too is abstracted away and is done issuing blaze deploy. Here too our glue will act like with any software project, meaning it will deploy the container to the orchestration engine associated with the driver. Regular services go to Mesos, but this ML model will be deployed using Clipper (which is backed by EKS on Amazon). Static metadata is stored in containers as a Dockerfile LABEL, so it can be easily queried for and extracted with our tooling, after which it is stored in Consul. Since we’re going through Clipper when building containers, we can’t do the LABEL trick easily without some additional patches, so for now we’re fine with having the metadata nicely decoupled using just the MLproject yaml file.

Our Jenkins configuration is dynamically generated based on GitHub repository names and labels (or as GitHub calls it: topics). With ML repositories there is no difference. So when the repository name contains a specific suffix (like -model) Jenkins will automatically create the appropriate build-job that will use the Clipper driver. The result is a clean way of deploying, just like with regular services.

Deploying blaze-clipper-model:56-1621f4d
Application blaze-clipper-model is not registered with Clipper
Deploying new model, creating application blaze-clipper-model
Application blaze-clipper-model was successfully registered
Model blaze-clipper-model:56-1621f4d is not registered with Clipper.
Checking local EKS proxy...
Deploying 2 instances of blaze-clipper-model:56-1621f4d to EKS
Successfully registered model blaze-clipper-model:56-1621f4d
Done deploying model blaze-clipper-model:56-1621f4d.
Linking model blaze-clipper-model:56-1621f4d to app...
Model blaze-clipper-model is now linked to application blaze-clipper-model

And this is what you would get when using blaze-cli to list all models:

MODEL                          VERSION      STATE INSTANCES INPUT
blaze-visual-similarity-model  34-b325e65   CA    2         integers

To quickly integrate the model containers with the rest of our stack, we created a simple gateway to basically just connect Mesos with EKS. This gateway is build around openresty, which is our main solution when it comes to creating gateways. Giving us the power of nginx for pushing traffic, with the scriptibility of Lua for adding our own customizations. The result is that services can now query any model using the regular inter-service communications patterns our stack already provides.

import json
import requests
x = requests.post(
    "http://ml-gateway.blaze/blaze-clipper-model/predict",     
    data=json.dumps({"input": [1.1, 5.1]})
    ).json()
print(x){u'default': False, u'output': 6.199999999999999, u'query_id': 42}

We realize our ML platform is still in it’s infancy, but that’s just normal. Instead of wasting time on something completely bullet-proof and made of gold, we went the usual Wehkamp-way: build an MVP and start extending that based on what is actually needed and/or required.

As an example of this, the metrics provided by Clipper’s own Prometheus aren’t connected with our main Prometheus instance just yet. Something that we would likely want to change (soon!), as it means we currently don’t have the same automation capabilities available when it comes to dashboarding and alerting around model metrics. A possible solution could be configuring Prometheus Federation to connect both instances. Furthermore, having yet another gateway is fine for now but given Clipper’s query-frontend is already kind of a gateway, we may want to figure out something else there too. And another thing on our list is making more use of the standardized MLproject metadata, fully connecting the training phase with deployments, and even look at using MLflow to also serve the models instead of Clipper :) And what about experimenting with different model versions?

In the end, developers, data scientists and data engineers requiring machine learning models for their services can now achieve that in a simple way. Everything they need to do or implement relies on the same patterns and tools they already came to know and love for building their Java, Scala, .Net or NodeJS services 🎉

The Wehkamp ML platform

Written by Harm Weites