MLOps leveraging AWS SageMaker, Terraform and GitLab

Ruter provides public transportation around Oslo and Viken regions in Norway. It operates around half a million departures daily, covering about 300,000 km and transporting around a million passengers daily. Ruter follows the “Transport-as-a-Service” model (TaaS) to enable plug-and-play IT systems for public transportation (ITxPT). This model allows Ruter to collect and own raw data from IoT sensors on board its buses, trams, and metros. These sensors continuously report different kinds of data, including the number of boarding and alighting passengers, inside/outside temperatures, and the speed and position of vehicles in real-time. All this real-time streaming data from the entire fleet of Ruter is ingested, processed, and stored in real-time by in-house developed backend systems.

Layout of IoT sensors in one of Ruter’s electric busses

This vast amount of data collected tells only half the story, i.e., what has happened in the past. To achieve our goal of providing personalised and contextualised services and a smooth travel experience, we must be able to inform our customers about what is likely to happen in the future as well. We leverage big data, machine learning and AI to forecast and predict these future events.

Ruter uses AI in multiple areas, for example, to predict capacity, passenger counts, and travel patterns. AI is also used to handle customer inquiries and to optimise traffic routes for our on-demand service. To maximise the gain from using AI, we should be able to move beyond PowerPoint presentations and feasibility reports and must be able to turn ideas into value-generating applications quickly.

AWS — Our Application and Machine Learning Platform

We use AWS as our public cloud provider and application and machine learning platform. We leverage Kafka and EKS (Amazon Elastic Kubernetes Service) for our real-time systems and S3, Glue, Athena and SageMaker for our historical, analytical, and AI use cases. We are going to dive into the latter.

According to VentureBeat, around 87% of machine learning models never make it to production. Many of those that make it, do not stay in production for long because they can not adapt to the changing environment fast enough. One of the reasons for this limitation is the lack of a robust machine learning(ml) platform. In Ruter, we have invested in a modern cloud (AWS) based machine learning platform which enables us to deliver our ml models at speed and with quality. It is built around MLOps principles and enables us to produce robust, scalable, reproducible and monitored ml pipelines.

In this article, we are going to discuss what our typical machine learning pipeline looks like.

Machine Learning Pipeline using AWS Step Functions

The machine learning team in Ruter is responsible for the end-to-end lifecycle of its machine learning models, all the way from data ingestion to model deployment and inference and making these insights and predictions available to other teams via Kafka. We follow the “we build it — we operate it” mindset. Since we are the ones who develop these machine learning models, we know how they should be monitored and fixed, and how their performance should be measured. Hence, it is best that we operate and manage the machine learning pipelines ourselves instead of handing them over to an external IT team for deployment. This way we avoid adding superfluous dependencies.

As mentioned earlier, we use AWS as our machine learning platform. We use additional tools to implement our ml pipelines: GitLab, Terraform, Python and GreatExpectations to name a few. We use GitLab for version control, development and CI/CD, Terraform for implementing infrastructure-as-code, Python to define pipeline definitions and GreatExpectations for data quality monitoring.

We started with our first machine learning pipeline in the first quarter of 2021. The pipeline would spin off AWS EC2’s (AWS virtual machines) instances for model training, batch predictions and publishing these predictions on Kafka. As you might have noticed, this pipeline was the bare minimum. It was not reproducible, scalable, robust or monitored. We needed something better than this to increase speed, add robustness and enhance the developer experience. After going back to the drawing board and going through a few iterations, we came up with the pipeline shown below.

Machine learning pipeline using AWS Step Function

The pipeline consisted of the following parts.

Event Bridge: Amazon EventBridge is a serverless event bus that makes it easier to build event-driven applications at scale using events generated from applications. We used it to trigger our ml pipelines at specific times. Based on the set time, it starts a lambda.

AWS Lambda: AWS Lambda is a serverless, event-driven compute service which triggered AWS step functions with custom input parameters.

AWS Step Functions: AWS Step Functions is a low-code workflow service which we used to orchestrate steps in our machine learning pipelines.

SageMaker Processing and Training Jobs: We used SageMaker processing jobs for various data processing jobs. For example, data loading from S3 and our data warehouse, data cleaning and feature extraction, data quality control, batch inference and publishing results on Kafka. We used SageMaker training jobs for model training.

S3: We used S3 for artefact storage. There we stored snapshots of raw input data, extracted features, data quality results and reports and models.

Migration to AWS SageMaker Pipelines

While AWS Step Functions is a good tool for orchestrating general workflows and state machines, we found it to be non-ideal for the machine learning pipelines because

  1. AWS Step Functions are implemented using Amazon Stage Language. It is a JSON-based structured language used to define a state machine. This meant we had to add one more language in our toolset to implement and maintain pipelines using AWS Step Functions. Since Amazon State Language lowers complexity for its users, it also reduces flexibility and control.
  2. AWS step functions handling of pipeline failures is suboptimal. If a step in the pipeline fails, you can not just rerun the failed step and start where you left off. You would have to restart the whole pipeline.
  3. AWS step functions do not provide machine learning-specific features, such as experiment tracking, model registry and endpoints out of the box.

We migrated our pipelines to SageMaker pipelines which provide the missing features and address the limitations of StepFunctions:

  1. You can define an entire pipeline in Python, which feels at home for Data scientists and ML Engineers.
  2. If a step fails in the pipeline, you can retry the pipeline, and it would start from where it left off on failure.
  3. SageMaker pipeline provides features like experiment tracking, model registry and endpoints out of the box.

Our current typical pipeline with batch inference looks like this.

Machine learning pipeline leveraging SageMaker pipelines with integration with Slack, Kafka and S3.

Each step in these pipelines emits logs and metrics using AWS CloudWatch, AWS’s default monitoring and observability service. From these logs and metrics, alerts are generated and forwarded to project-specific Slack channels. This ensures that we are the first to know about any outage or failure. We either fix the issue before our downstream users notice it, or we can notify them in advance if the fix takes time. The metrics are also displayed in dashboards (we use both Cloudwatch and DataDog) which helps us monitor the pipelines and debug the issues effectively. Furthermore, each step is configured with exponential backoff and retries. If a step fails, it will retry a configurable number of times before it gives up and fails the whole pipeline.

Another benefit of using SageMaker pipelines is the SageMaker Studio. It provides a single, web-based visual interface where you can perform all ML development steps, improving data science team productivity. Using the studio, we can have an overview of all our pipelines. We get the status of current and all previous executions, along with their start and run times. We can also find pipeline parameters, container versions etc., for each execution. We can also retry failed pipeline executions, start new executions or investigate a previous run. Each pipeline execution is locked to a snapshot of the data used for training and inference. Leveraging SageMaker and data snapshots, we can rerun a previously executed pipeline with the exact same data, models and parameters.

Snippet from SageMaker Studio
Detailed overview of a pipeline

It is worth mentioning that even though SageMaker provides lots of features, we use SageMaker pipelines soberly, only using the features we believe add value to our services and use cases. There is still a lot to explore and try from the complete set of features it provides.

Integration with Terraform and GitLab

We use Terraform, an open source infrastructure as a code (IAC) tool, for all our cloud infrastructure and pipeline implementations. We follow standard DevOps practices, and Terraform enables us to have an exact copy of infrastructure in our dev, test, stage and prod environments.

Unfortunately, the AWS Terraform provider does not (as of August 2022) have a SageMaker pipeline resource. To circumvent this issue, we use the AWSCC Terraform provider, which does have a SageMaker pipeline resource, to deploy our pipelines.

As mentioned above, we define our pipelines using Python, i.e. the SageMaker Pipelines Python SDK. We create a Python script with the pipeline definition, which returns a pipeline definition as JSON when run. This script is executed by Terraform through an external data source, and the returned pipeline definition is sent to the AWSCC SageMaker pipeline resource. This way, we keep our SageMaker pipelines controlled by Terraform while simultaneously allowing us to write the pipeline definition using Python.

We use GitLab CI/CD with Terraform to build, change and manage our infrastructure and pipelines. We have designed the setup in such a way that when a developer creates a new development branch, GitLab CI/CD automatically creates an entirely isolated and sandboxed development infrastructure for that project. Developers can then use this dev infrastructure for development and experimentation purposes without worrying about resources or affecting any other services. Once the development is completed, and the dev branch is merged into the test branch (we use test, stage and main branches for infrastructure), the dev infrastructure, along with all of its resources and data, is destroyed by the GitLab CI/CD. This gives our developers the flexibility and speed they need to work in a modular fashion and decreases the threshold for experimentation and trying out new ideas.

Creation and deletion of dev infrastructure using GitLab’s CI/CD and Terraform

Model Performance Monitoring

So far, so good. We have complete control of our pipelines and workflows. If something goes wrong, whether it is an external API failure, data quality issue or failing jobs, we have systems to notify us immediately via Slack channels and email. But once the model gets deployed and is out in the wild, how do we know if it produces good quality predictions and valuable information?

To answer this question, we have used yet another AWS service. We use Amazon QuickSight to power our model performance monitoring dashboards. These dashboards display the predictions produced by our ml models and the ground truth (as it arrives) along with custom-generated metrics, which helps us evaluate the performance of the ml models live in action. Using this dashboard, we can, for example, drill down to the individual departures and check if the model predicted the correct number of onboard passengers for those departures. We also monitor model drift, and QuickSight triggers an alarm if the model drifts beyond a defined threshold. Since all the backend processing for these dashboards is implemented in Terraform, we can perform A/B testing by having development dashboards auto-generated for our development branches. This brings a lot of power to our developers, who can work with multiple experiments, each with its associated dashboard to monitor the performance of their ml models live in production.

Model performance monitoring dashboard

Future Plans and Improvements

We are quite content with the current state of our the platform and pipelines, but there is room for improvement. We are currently not using SageMaker to its full potential and there are still some features which we believe could add value to our existing pipelines and workflows. For example, we wish to use SageMaker more actively for experiment tracking. Another feature which we would like to try out is using Amazon SageMaker Feature Store, which is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Enhancing developer’s experience (adding speed and reducing complexity) is a prime focus for our future developments and enhancements.

I joined Ruter in June 2020 and have, since then, managed the machine learning and data engineering team within the Data science department. We try to leverage advanced statistics and ML in combination with a modern data stack to better public transportation and improve customer experience, in the greater Oslo area.

Thanks to David Skålid Amundsen, Daniel Haugstvedt, Erlend Fauchald and Simen W Tofteberg for making this achievable.

--

--