Analysis of Billions of Videos: Reproducible ML Experiments and Automated Deployments of ML Models at Scale

Jellysmack Labs

Published in

Jellysmacklabs

22 min readNov 29, 2022

Émeric Dynomant & Kenza Boussaoud

Quick and dirty, regrets incoming

Jellysmack’s goal has been, from the very beginning, to support growing video creators. The need for machine learning (ML) models rose very early in the company, as it started on strong IT engineering bases closely linked to video intelligence (VI) and natural language processing (NLP) challenges. For example, transcribing a video, embedding a YouTube channel in a multi-dimensional space, or automatically blurring parts of naked bodies were some of the first projects Jellysmack had to work on. Nowadays, many different ideas are under development, and already trained models perform inferences daily (for batch predictions) or on-the-fly (for real-time ones).

As a quick start, Jellysmack decided to orchestrate machine learning training and inferences with Apache Airflow.

Even though this scheduling tool is more about Extract-Transform-Load (ETL) operations on Directed Acyclic Graphs (DAGs), it allowed us to rapidly orchestrate data flows and model training through custom Docker images running into a Kubernetes cluster (Figure 1).

Figure 1: Old ML models management system through a managed Apache Airflow solution. Training DAGs produce models that are loaded by inference DAGs to predict on-the-fly with posted data or scheduled to run on databases.

Pilot, an internal Python framework, allows us to create a Python job with different modes. In training mode, the Docker image resulting from the job runs on an EKS Kubernetes cluster to train a model, then stores it in binary form on an S3 bucket. In inference mode, for real-time predictions, the Airflow pipeline is triggered through the Airflow API, and the target job will load the model from S3, infer based on the sent data, then post the result to an API. For data batch processing, the Airflow inference pipeline is scheduled every X hours to follow the same workflow: load the model, and predict, but write the result in a database (DB) instead of posting on a REST API.

This training/prediction flow is the result of quick-and-dirty solutions. It allowed us to quickly start working with ML models with relative scalability (due to using the managed version of Airflow on AWS and a Kubernetes cluster). But adding more and more layers to this workflow also led to chaotic solutions such as our Airflow prediction pipeline triggering flow from a frontend interface (Figure 2).

Figure 2: Simplified scheme describing the usage of a trained model to infer based on data coming from a graphical interface (GUI). Once the end-user changes a value, a message is posted on a message broker queue (Simple Queues Service, SQS); that event fires a Lambda function that will call the corresponding DAG on Airflow. This DAG will update the service database used to fill the user interface.

While this workflow remains correct regarding business logic and the new principles of cloud-based architecture (serverless, event-based, etc.), it needs some modifications to improve security, cost, scalability, and resilience. This particular workflow takes almost one minute to run while loading the model itself, and computing the prediction lasts only about 7 seconds (due to the cold start of different AWS components). As such, almost 90% of the time elapsed from the moment the user changed the score to the newly computed value being brought to the front interface is lost between AWS component calls, HTTP requests, and time to provision the different machines running serverless components.

ML platforms, a mess of promises

To start upgrading our flow, we began by benchmarking some ML platform-as-a-service (PaaS) solutions to identify the one that meets our needs. But the first real question was: “How are all of these on-the-shelf ML solutions different?”. They all show excellent examples in tutorials or published articles, yet made on some simple and synthetic dataset. Most of them also rely on a lower-level component from a big tech company (IaaS, such as cloud providers) and offer nothing more than a wrapper around this solution, allowing to set up and parametrize the launched components.

We also considered our ML needs (automatic hyperparameter tuning, continuous training, and federated learning). For most of the existing solutions, at least one of our future needs was missing. At this point, we knew that many custom developments would be needed in the future, and integration of custom and on-the-shelf solutions is not easy. We decided to remain in our current cloud provider’s ecosystem and started digging into AWS SageMaker.

SageMaker is presented as an easy to setup platform. Still, we quickly figured out that although it was the case for built-in algorithms, everything customized is slightly more challenging to set up. However, it contains built-in features for experimentations, model training and delivery, hyperparameters tuning, and many interesting functionalities we were excited to try.

Experimentation versus industrialization

From our point of view, an ML model goes through two distinct phases: experimentation and deployment.

Experimentation is the day-to-day work of Applied Scientists (AS, formerly called data scientists at Jellysmack). They start from a business need and try to figure out, by iterating, how to answer that question by designing an ML model that would work out rules based on a defined data structure.
Once the experiment is finished, two things should go into production: a dataflow defining an ETL (to get fresh data every day in the way the model needs it as input) and Python jobs to train and use the model.

We chose to create a clear separation between those two phases. We located the experimentation and the production parts in two different Git repositories. Instead of throwing into production the model resulting from the experiment, the idea is to create:

An ETL pipeline on Airflow that gets the data needed to train the model.
A training pipeline on SageMaker that trains the model.

We’ll now explore those two phases of an ML model at Jellysmack.

Experiments

As previously defined, experiments cover the day-to-day work of our applied scientist teams. The idea is to iterate in a Jupyter notebook to answer business needs with ML solutions. The input data goes through much processing until a clean and meaningful structure is found and relevant features have been selected. Many different models can be tested with varying parameters, etc. The needed operations on data for feature engineering should also be defined.

During these experiments, there is a high risk of knowledge loss. Indeed, the higher the number of iterations during an experiment, the higher the risk of getting stuck into a “dataframe_1”, “dataframe_2” loop, and other meaningless file and variable naming. The same applies to model binaries that can be lost (if not saved between iterations). At the end of an experiment, nothing links a given model to a set of associated parameters. Hence, it’s very complicated to re-launch an experiment in the same environment a few months after it ended. Indeed, when the applied scientists were working on an on-premise machine, a cleaning of the hard drive was often needed to save disk space and avoid polluting the operating system since virtual environments were not used. This leaves us with a list of non-reproducible issues to tackle.

Another issue with this setup was the cost. Some of our GPU machines were located in our office in Paris. This hardware requires someone to secure, update and maintain these machines. Debugging issues demanded a lot of time and effort as some projects were critical, and AS could not be left without a development environment for a long time. For some specific needs (RAM or CPU-intensive projects), we also started a few EC2 instances on AWS. Those instances were never turned off (or paused) for sometimes months, and the bill was high for a machine only dedicated to experiments and barely used a few hours a day.

With all of this in mind, we worked on providing an environment that allows for the reproducibility of experiments, with an automatized deployment through Infrastructure as Code (IaC). AWS SageMaker offers a large number of features to work on ML models. For the experiment phase, we focused on SageMaker Studio, a Jupyter lab instance integrated into the AWS ecosystem. It provides a way to change the EC2 machine hardware specs after the runtimes, allowing one to request more RAM directly from the notebooks. It also provides a persistent home folder for all users, making things easier to save files during the experiment. In addition, we defined what had to be versioned to avoid knowledge loss between trials of an experiment and to be able to easily relaunch a project, even months after it ended (Figure 3).

Figure 3: The different components of an experiment that will be versioned to improve the reproducibility of those works at Jellysmack. The S3 buckets will allow keeping track of the various objects created (dataframes, JSON, binaries, etc.). The ECR will fully version the execution environments (dependencies, environment variables, etc.). Finally, the Git repositories will be used to keep track of the produced code (notebooks).

Code versioning

For projects involving code, a Git repository is often used when discussing versioning. AWS SageMaker provides a way to link a Git repository with a SageMaker Studio environment out-of-the-box, associated with a graphical interface.

We now handle experiments as Git repositories, providing a way to version the code and keep track of the modifications. We used the same logic as in software development: features branches have to be merged on develop branch while pulling a release branch from develop creates a new tag and allows a merge on the master one.

Environment versioning

The environment of an experiment can be seen as a mix of:

Environment variables to get their values at a given time.
Versions of all used executables (Python version, e.g.).
Installed dependencies and their versions (pandas version, e.g.).

Many pre-built python environments are provided in SageMaker (TensorFlow, PyTorch, etc.), allowing the user to start working immediately. However, Jellysmack’s needs are more specific. We wanted some of our internal libraries to be available in the notebook environment at its startup, not make each applied scientist install them.

In Jupyter, the python executable is called a kernel. It contains everything used to run “python -m …” commands. In SageMaker Studio, AWS allows you to bring your own kernel as a Docker image. This was a great way for us to version the environment at different steps of the experiment: once you add or remove a python dependency from within the Jupyter notebook, a CICD pipeline creates a Docker image containing those modifications and pushes it to the ECR (our Docker container registry in AWS). Then, whenever the environment is modified, those changes are pushed on our Gitlab, and a gitlab-ci pipeline is fired.

To help reduce the build time of those containers, we first created a base image. It contains everything shared by all experiments: basic ML libraries (pandas, sklearn, numpy, matplotlib, etc.), all Jellysmack internal libraries (CICD steps for the experiments repositories, object versioning, SageMaker utilities, etc.), and shared environment variables (AWS account IDs, AWS region, etc.). Of course, when scientists want to upgrade (or downgrade) any requirement that is already contained in this base image, they can easily manage their dependencies with the Python packages manager we are using (Poetry) directly from SageMaker Studio.

From then on, for all created experiment repositories, the CICD will create a Docker image, laying on the base one, that adds the specific environment variables and installs the additional dependencies (compared to the base image) in the kernel.

Objects versioning

Objects (data, models, etc.) versioning was much easier to handle. As we are working on AWS, the Simple Storage System (S3) already includes an object versioning built-in capability. The different versions of a given file (regardless of its format) are listed on the interface, and you can easily add metadata during the object upload.

The idea to develop an internal library tasked with loading and unloading metadata-enriched objects from S3 naturally emerged. A search feature allows you to look into the bucket for a specific file with the corresponding metadata search pattern.

# Send data to the S3 with metadata
response = saver.put(
    filename="data_youtube.csv", dataframe=dataframe, metadata={"origin": "youtube"}
)
# Delete the data frame
del dataframe
# Get it back by using the metadata
dataframe = saver.get(filename="data_youtube.csv", search_pattern={"origin": "youtube"})

A typical use case is when applied scientists create a Pandas dataframe and modify it numerous times. It is really easy to lose track of these modifications. Now, they only have to write a single line of code, as shown above, to send the data on a versioned S3, so Jellysmack keeps track of how this dataframe was built.

We are aware of the limitations of such a search system using exact-words pattern matching. To overcome those, we consider inserting the user-entered metadata into a string similarity service (such as the AWS-managed version of ElasticSearch). Then, once a user searches, this action would be assumed by the Elastic engine by matching these requests against the entire amount of objects metadata, allowing both exact and extended searches with actual search engine capabilities.

Experiment tracking

An ML experiment is an iterative process that can result in multiple training runs.

The SageMaker Experiments feature allows us to track these experiments by automatically saving the inputs, parameters, and results.

experiment = Experiment.create(
    experiment_name=experiment_name,
    description="This experiment is the shiniest ever, and will lose nothing due to versioning.",
    sagemaker_boto_client=sagemaker_client,
)

Each experiment is a set of trials that can be accessed directly on SageMaker Studio to be tracked and to save metadata or metrics from it.

for i, k in enumerate([2, 5]):
    trial = Trial.create(
        trial_name=f"sklearn-k-{k}-{int(time())}",
        experiment_name=experiment.experiment_name,
        sagemaker_boto_client=sagemaker_client,
    )

For example, we can see in the screenshot below the interface SageMaker Studio offers. Trials from a given experiment can be compared all together as a table, or graphs can be drawn automatically. This plot shows the standard deviation over two features of the Iris dataset (blue, orange, and red lines are the deviation over sepal length, others from sepal width) for six different runs (Figure 4).

Figure 4: Screenshot of the SageMaker GUI plotting the length and width of some flower sepals (Iris dataset). This dummy plot is simply intended to show the built-in capacities of SageMaker to create graphs from defined trials.

Industrialization

Now that the applied scientist has found the data structure, engineered the interesting features, and assumed the model parameters, it is highly likely that an already trained model has already been created during the experiments. We could have chosen to simply send this version into production. Instead, to make this training more reliable, reproducible, and without human intervention (if trained from a Jupyter notebook, human action will always be required to train a new version), we chose to move from this notebook to a proper pipeline in SageMaker.

We wanted the training to be fully scheduled and to run autonomously on new data from time to time. One of our requirements was that the service deployment that will use this trained model (REST API, Lambda function, or dropped as a binary on an S3 to be used by another SaaS) be fully automatic. After the training, we wanted to only need human validation for the newly trained model version if no automated tests were available. Once it is considered valid (by human approval or by succeeding the automatic quality tests), the model and its service should be deployed directly.

Old model training and deployments

Before the Machine Learning Engineer (MLE) position was introduced at Jellysmack, by convenience, only Data Engineers (DE) were responsible for deploying training jobs and models in production at Jellysmack. As a result, those were managed by Airflow, as it is the primary job orchestration tool that was used since Jellysmack’s creation. The issue with this tool is that it hasn’t been developed to achieve data science tasks. It thus has no facilities to compare training, output metrics, or automatically deploy models as route endpoints on REST APIs. The idea was to create an Airflow DAG, with one of its nodes being a Kubernetes operator launching the data science job on an EKS cluster using the Docker image constructed by the job’s CICD.

Using an internal framework to organize tasks into pipelines added a lot of complexity. It was a nice addition at first. Still, it also hid complexity as it acted like a DAG within a DAG. This was due to this framework allowing for the organization of tasks into a DAG within a Docker container that acted as a step of another Airflow DAG.

Models were dropped as binary files on S3 by training and loaded back by a Docker container, also running as Kubernetes pods during the execution of prediction DAGs. This inference flow was quite OK for scheduled forecasts but not convenient for in-real-time ones (due to cloud components’ cold start, as mentioned in the introduction).

A message was posted in an AWS Simple Queue Service (SQS) queue to achieve triggered predictions, and a Lambda function reacted to this message, triggering the DAG through the Airflow API. This pipeline then launched a pod on Kubernetes to compute the result. It lasts three times longer to start all these AWS components than to execute the model’s transform method.

As we needed to change the backend of one of our products to an event-based architecture, we decided to change the way ML model training and deployments were handled, following the same event-based triggering logic.

Training

Our goal was to industrialize the deployment stage (i.e., setting a model in production) in the most efficient and automated way possible. Naturally, we turned towards SageMaker Pipelines, a component that allows us to orchestrate ML workflows. A pipeline is no more than a set of steps defined like a DAG that we built programmatically in the form of a JSON file. The idea is to create this JSON, upload it on S3 and define a Lambda function that handles creating or updating the pipeline in SageMaker (Figure 5).

Splitting responsibilities

This new way of managing training jobs made us adopt a fresh and smooth workflow process to define the steps of the pipeline:

Data preparation and transfer: Data engineers remain in charge of all ETL processes related to the experiment. It allows us to benefit from their expertise and manage costs: running expensive and time-consuming data preparation tasks is more interesting outside the SageMaker ecosystem. Data is stored in a dedicated database available and ready to be used by training scripts. To do so, we are mainly using our managed AWS Airflow. This step also gets the feature engineering methods defined during the experiment.
The applied scientists take over to create a training script after the experiments they did. The idea is to get the data from the previously filled data source, then train the ML model on that data.
Finally, a machine learning engineer handles the orchestration and deployment of this model through SageMaker Pipelines.

Pipeline steps

It is fashionable to create Python factories, taking JSON, YAML, or any other kind of structured text as input and output Python code from this declaration (called configuration as code). The Software Development Kit (SDK) provided by Sagemaker does the opposite. It provides Python classes and objects you can manipulate to get a JSON definition of the pipeline as output.

Briefly, a pipeline runs each defined step in a Docker called a processor. Processors can be different for each step of the pipeline, allowing for fine-grained control over dependencies: data extraction/transformation-related dependencies for the first step of the pipeline, model training for the second one, etc. Those Docker images have been pre-built by different CICD steps specifically for different scenarios: does the input data come from a Postgres or MySQL database? Is the model TensorFlow or PyTorch based? Each processor also gets its security group, allowing for precise control over the grants authorized to each processor.

A JSON pipeline definition is hosted in a Git repository, containing the code for each step, the script defining the pipeline, and a CICD that will:

Create the JSON definition of the pipeline.
Store it on an S3 bucket for versioning.
Create a Parameter Store for the pipeline configuration.

Pipeline configuration

To manage our pipelines’ configuration, we rely on the AWS Parameter Store component to centralize everything from job arguments to orchestration parameters. This feature provides many practicalities, including the possibility of controlling which IAM users can access and modify which Parameter Store with ABAC.

This store is created empty by the CICD. We defined some mandatory keys that people involved in the model deployment (AS and MLE) must fill in before the first production startup:

Generalities like the timezone of the job or the date format.
The wished orchestration for the training pipeline (running managed by a CRON or by a fixed rate, every 12 hours, e.g.).
Data about the project itself: which squad owns the project, module, job name, etc.
AWS account-related data: account ID, the profile name for this pipeline, region, etc.
The ML framework used to create the model (used to deploy it after the training, see “models deployment” section below) and the wished kind of deployment (API or batch).
And, of course, it also contains every parameter requested by the training job itself (number of epochs, batch size, LR value, etc.).

Once all parameters have been filled by all of the engineers (DE/AS/MLE) working on the project, the JSON definition can be generated using those values, and the pipeline can be defined in SageMaker.

Pipeline run

Now that all the steps have been defined and configured, the CICD creates the JSON pipeline definition and pushes it to an S3 Bucket. From there on, the JSON is used to update or create a pipeline in SageMaker.

The target S3 bucket will invoke a Lambda function that fulfills two needs:

Creating (or updating) the pipeline on SageMaker.
Updating the Event Bridge rule (used to orchestrate the pipeline runs).

When triggered, the Lambda function will use the Python client for AWS (boto3) to first download the JSON definition and use it to create/update the pipeline. Next, it needs to be orchestrated. To do so, we chose to use an Event Bridge rule, allowing us to schedule:

Either with CRON, allowing us to define precise start times (first Monday of the month, odd days, etc.).
Fixed scheduled rate: every 12 hours, every 5 minutes, etc.

For the definition of this rule, the Lambda function uses CDKTF, a Python library, to interact with our IaC tool, Terraform. This software is used to provision all of the serverless components we are using on AWS in a declarative way. It updates or creates the Event Bridge rule pattern, which will fire the corresponding SageMaker pipeline as a trigger. Once done, this training pipeline is now defined and orchestrated (Figure 6).

Figure 6: Screenshot of an example pipeline defined in SageMaker. Each step is a Docker container, containing all dependencies for a specific script to be run inside. All logs and parameters for each step are directly accessible through this interface.

It will run periodically, each run creating a new version of the corresponding model in the S3 bucket and inside the SageMaker Model Registry.

Model registry and quality gate

The end results of these pipelines are models. They will be saved on S3 for versioning and registered in the SageMaker Model Registry.

As Jellysmack is more into custom-made models and architectures, we prefer to deploy our models ourselves (instead of using the built-in facilities to deploy them directly from the registry). It allows us to be more scalable and cost-effective by controlling our infrastructure. But one of the Model Registry features seemed interesting: the possibility to change a model status (draft, accepted, refused). Changing such a status generates an event on AWS, and the Event Bridge component can catch this event. It will allow us to trigger the deployment of a given model as soon as its status changes from draft to accepted.

Either the data quality team (they are already in charge of testing web interfaces and ensuring data quality in Jellysmack) or someone from the business teams who asked for this model will handle this quality gate. Once it is tested and validated, it is ready for deployment, either as an API or as a batch predictor.

Models deployments

Once a model has passed this quality gate, it is ready for deployment. An Event Bridge rule will catch the event generated by this status change. It is a mapper that allows explaining processes like: “If the event indicates that a model changed from draft to accepted, propagate it to this Lambda function.”

The deployment management Lambda will react to any event of this kind. It will first contact the Parameter Store associated with the model that has just been trained to get information about the desired deployment type. As seen in the “Pipeline configuration” section, this parameter has been defined as mandatory (Figure 7).

Figure 7: When the newly trained model version is accepted, a Lambda function will handle the deployment. It first needs to contact a Parameter Store containing the wished deployment type. From this, it can get the corresponding deployment scripts on our Gitlab.

One of the exciting features that we thought of was a unified deployment flow, regardless of the type of framework used to train the model (TensorFlow, PyTorch, SKlearn, or custom model).

To achieve this, we created an abstract Python class that re-implements the load() and transform() methods for all scenarios. So even with very different types of models, we only need to call the wrapper.transform() method to get a prediction. This unified interface allows the deployment of different models trained with different frameworks instead of forcing everyone to use a global ML framework in Jellysmack.

We chose to manage two ways of deploying models. REST APIs for easy communication through a well-known protocol, allowing all programming languages to use the model. To reduce the network bandwidth and the cost, we can deploy our model as SageMaker Batch Transformers. Those components can be scheduled and run automatically on a complete batch of data instead of being called for each new data point, as we would do with an API. For example, it allows us to process all data produced the day before every morning.

Real-time predictions

As seen in the Model Registry and Quality Gate section, once the training pipeline finishes, a trained model is registered on the SageMaker models’ registry with a draft status. When the status switches to accepted, it fires an event that will trigger the execution of a Lambda function. From this event, this function will extract the name of the model that has been accepted and will check for a corresponding set of parameters in the AWS Parameter Store.

One of those parameters contains the deployment type for this model (a deployment is a set of Terraform files that describe the service where the model should be deployed in). Currently, we can deploy our model as REST API on the Elastic Containers Service (ECS) or asynchronous Lambda functions associated with SQS queues. These parameters will provide the location of the matching scripts in a dedicated GitLab repository hosting all the deployment Terraform stacks.

The framework will create a pipeline using those stacks (assuming we choose to deploy the model as a REST service on the ECS). This pipeline will first build the docker image containing (Figure 8):

an API designed with the FastAPI Python framework;
the model object loaded using our model wrapper.

Figure 8: Simplified scheme of deploying a trained model as a REST service on the AWS ECS. When the second version of a given model is trained, a Docker image containing the REST API and the trained model layers is built (service). The model wrapper allows us to re-implement each model’s transform() method. We are then agnostic about the Python ML framework (TensorFlow, PyTorch, etc.) used to train the model. The API calls this wrapper following a call through the load balancer, which has created a new route for this specific model version.

Once the Docker image of the API is created, it will define, thanks to Terraform, a module that represents this component: one service on ECS running behind a load balancer. The ECS is an AWS component used to launch highly scalable services such as Docker containers. The load balancer allows us to add a route for each deployed model constructed as follows:

https://<BASE_JSK_URL>/predictions/<MODEL_NAME>/<MODEL_VERSION>/predict

By posting data on this endpoint, the API will use the deployed model in the specified version to compute and return the output. Thanks to this flow, we can have several models and versions of models in production and interact with any of these using this load balancer.

Batch predictions

Batch predictors are scheduled processes meant to run every day to treat the newly acquired data points (basically, our acquisition can run every day at midnight, and that data should be processed before 7 AM).

To do so, SageMaker provides a simple way to deploy components called batch transforms. Our models are packaged in a specific Docker image, and AWS handles the entire process of scheduling the service from this Docker image. It allows us to quickly process everything stored in the data lake by acquisition processes and fill our service database with new data daily. This deployment is triggered through the SageMaker SDK (Figure 9).

*Figure 9: Screenshot of the SageMaker GUI during a model endpoint creation. We can create a batch predictor directly through the SageMaker SDK from a notebook instance.*

While this feature was sufficient for us to start working quickly, we are considering using our managed Airflow setup for these batched tasks. In our view, using an ML model to apply a transform method on a complete batch of data is nothing more than a fancy ETL step. As Jellysmack already has a highly skilled data engineering crew and all the tooling and everything already set up to work with Airflow, using it would not require additional training. The idea would be to start many Kubernetes pods in parallel, each of them loading the model from its binary form stored on S3 and processing a batch of the total amount of data.

It would then be scalable (due to the fact we can start any number of pods we want), easily scheduled (Airflow is exactly a task scheduler), and cost-effective. Indeed, data preparation can sometimes be long, and running it in pods on our EKS would be significantly cheaper than on expensive EC2 machines through SageMaker.

Infrastructure as code and IAM

To keep things clean, we chose to deploy everything with Terraform. This IaC solution offers a declarative way to provision infrastructure components. We roughly divided our Terraform codes as follows:

A Terraform stack handling everything global to the entire SageMaker platform (some ECRs, Lambda functions, the load balancer of the deployed model APIs, etc.).
A terraform module that can be instantiated into a stack specific to a Jellysmack squad. It allows for the definition of the experiments of a given project. It handles the creation of the associated Git repository (version code), ECR (environment), and S3 bucket (data).

Managing the cloud components with such a tool brought us reproducibility and knowledge of the actual state of the infrastructure just by looking at the Git repository hosting those infrastructure declarations.

IAM definition

AWS roles have had to be created to allow AWS components to interact with the infrastructure. They define operations permitted to the component (or human IAM user) that will endorse this role. For example, if a Lambda function inherits a role allowed to ecr:CreateRepository, it will be able to create a new repository on the ECR. We call execution roles those that are defined primarily for AWS components and cannot be used by human users. For example, each pipeline step could have different grants on AWS, start or delete various components, etc.

We decided to split those execution roles into three different ones:

SagemakerExperimentsRole
SagemakerTrainingPipelinesRole
SagemakerDeploymentPipelinesRole

The first authorizes SageMaker studio to perform operations such as putting files on S3 buckets (to version Python objects from Jupyter) or launching an EC2 machine (to start a training session during the experiment). All of the training pipelines use the second one. It can store trained models on S3 or in the model registry, start SageMaker pipelines, or push images on the ECR. The last one allows the deployment pipelines to create the IaC stacks needed to be deployed to create, for example, APIs (services on the ECS with routes on a load balancer) to get predictions from.

User groups have also been designed to authorize human users — and not only components — to interact with AWS. First, a group of Sagemaker users has been defined. Everyone who belongs to this group can list defined experiments, see training and deployment pipelines, and see what is in the model registry. A second group for Sagemaker developers has also been created. It provides the same grants as the group defining users but adds some rights, like reading values from the Parameter Store that are needed to work on the platform properly.

Take home, and what’s next?

This two stages process (experimentation, then deployment) allows a smooth flow for the ML teams in Jellysmack. The squads of AS are now autonomous to quickly create everything to start a reproducible experiment.

Once the results of an experiment are satisfying, the setup in production consists of writing simple Python scripts (one for each wished training pipeline step) and configuring everything thanks to a JSON in the AWS Parameters Store. Then, the designed event-based flow will create and schedule the training pipeline.

Once a new version is trained and validated, the JSON configuration also provides a wished deployment of this model. A Lambda function handles the usage of Terraform files defining what the expected service should look like.

This fully automatic flow allows our scientists to be focused on their experiments by obfuscating the steps of managing the reproducibility of their experiments and the entire flow of training and deployment of the models.

In the future, we aim to explore more deployment possibilities better and to try to automatize what we can in the platform: building a collection of custom processors to be used in pipelines as needed, creating a continuous training loop, exploring the Bayesian hyperparameters optimization, or using an automatic model evaluation system.