Transforming data science with Vertex AI: Telepass journey into MLOps

Gioia Sarti
Google Cloud - Community
7 min readMar 13, 2024

In the fast-paced world of toll and mobility services, Telepass stands out as a leader across Italy and several European countries. In recent years, Telepass has strategically migrated all the operations to the cloud and established a robust data platform on Google Cloud. With Google Cloud, Telepass efficiently handles and derives value from the vast collection of data generated daily through toll and mobility services usage. Recognizing the immense potential within this data, Telepass makes a decisive move to expedite machine learning developments by embracing MLOps.

This article is about Telepass’ journey in implementing MLOps: from initial challenges to resulting architecture. You will learn about the benefits of embracing MLOps, some lessons learnt, and future evolutions. By the end of this reading, you will have a better understanding of how Telepass moved towards a modern ML platform with Google Cloud and Go Reply, enabling data scientists to scale the impact of data on business. Whether aiming to enhance your grasp of MLOps or get inspired for implementing frameworks in your projects, this article provides valuable insights and practical takeaways that can assist you along your MLOps path.

Why MLOps? The Catalyst for Change

Before adopting the MLOps framework on Google Cloud, at Telepass, each data scientist had a personal Compute Engine instance, with predefined and fixed resources (CPU and memory), for performing Exploratory Data Analysis (EDA) and developing training, prediction and evaluation scripts for ML models using R-studio. Once completed, the scripts were manually transferred and scheduled on enhanced VMs for scaling them as you can see in the figure below.

Figure 1. Cloud porting

In this initial configuration, the limited use of ML — oriented managed services restricted the benefits of cloud adoption. The entire ML process lacked automated workflows, continuous integration/deployment and version control (specifically for code and models), leading to challenges in development processes. Monitoring was insufficient, making deployments challenging, and security was compromised due to the absence of a dedicated production environment.

Despite these drawbacks, the initial architecture laid the groundwork for a more robust ML infrastructure for Telepass, which translated into an incremental adoption of the MLOps framework.

Architecting MLOps with VertexAI

Prediction automation

The adoption of the MLOps at Telepass started with automating prediction serving. The following figure shows the framework that automates this process.

Figure 2. Prediction Automation

The architecture marked the introduction of Vertex AI services, Google Cloud’s platform dedicated to ML, namely:

  • Vertex AI Workbench: a user-managed notebook for each data scientist, with resources dynamically adjustable by the DS themselves.
  • Vertex AI Model Registry: a managed tool for versioning and documenting models.
  • Vertex AI Predictions: a service for running batch predictions, given as input a model uploaded to the Model Registry and a dataset of instances.

In addition to incorporating model governance with Vertex AI Model Registry, the architecture also included GitHub as a code versioning tool. And together with Vertex AI Prediction, those services ensured a level of transparency, reproducibility, and traceability of predictions that was previously unattainable.

Automating prediction serving was achieved through scheduling an “inference” script that generates the prediction dataset and the use of Cloud Functions to connect and coordinate the various components involved.

At this level of automation, some key functionalities were still lacking:

  • CI/CD
  • Continuous training
  • Management and segregation of environments.

The absence of CI/CD, continuous training and environment management have several implications. Without CI/CD integration, the deployment process lacks automation and standardization, increasing the potential for human error and inconsistencies between development and production environments. Continuous training is vital for ensuring that ML models remain accurate and relevant over time by incorporating new data and insights. The lack of this feature results in outdated models that fail to adapt to changing conditions, ultimately diminishing their utility and value. Furthermore, inadequate management and segregation of environments can lead to conflicts and complexities in deploying and maintaining models across different stages of development, impeding scalability and hindering collaboration among data scientists and engineers.

MLOps

Framework

To address prediction automation challenges, Telepass decided to extend the MLOps on the entire model lifecycle and build an complete MLOps platform on Vertex AI. In the following figure, you have a high-level overview of the Telepass’ MLOps platform on Vertex AI.

Figure 3. MLOps — Framework

The heart of the platform lies in Vertex AI Pipelines, which orchestrate all the steps of a model’s life cycle. For each model, the data scientist develops three fundamental pipelines: training, prediction, and evaluation. The lineage of the artifacts generated by the pipelines (models, datasets, metrics, …) is automatically saved within Vertex AI Metadata, ensuring complete transparency and reproducibility in every phase.

The CI/CD process, implemented with Cloud Build, is triggered by specific actions on the GitHub repository. The output consists of compiling and uploading the pipelines to GCS, and creating or modifying their scheduling on Cloud Scheduler.

Lastly, there is a monitoring and alerting system built on Cloud Monitoring, which notifies data scientists via Slack in case of errors during pipeline execution.

Operational workflow

When a new ML use case is identified, the Telepass data science team follows the MLOps workflow in the figure below.

Figure 4. MLOps — Workflow

The process started with the data scientist creating a feature branch from the development branch. In this new branch, the data scientist implements the specific logics of the use case and tests them in a prototyping environment using the three types of pipeline templates (training, prediction and evaluation). Once the development and testing phases are completed, the CI/CD process performs the pre-release of the pipelines in the same prototyping environment. If this execution phase is successful, the development branch is merged into the main branch, and through a Cloud Build pipeline, the pipelines are deployed into the production environment.

Notice how Telepass preferred an architecture unfolded into two environments. And the segregation between the data scientist’s development processes and the pre-releases in the prototyping environment is achieved by structurally separating the resources (BigQuery datasets and Google Cloud Storage buckets) dedicated to each.

Realizing the benefits of MLOps

With MLOps, Telepass’s data team incremented and sped up ML development realizing the business value of ML.

After only a few months of onboarding, in 2023 the team started to work on new use cases using the MLOps framework on Vertex AI, as shown in the chart about Github repository daily commits below.

Figure 5. Daily Github commits in the MLOps code base

The result has been an increase in the number of running pipelines as you can see in the figure below.

Figure 6. Total number of Vertex AI Pipelines runs in 2023

But more importantly, the number of use cases that were rapidly developed and deployed in production in the second half of the year significantly increased, confirming the impact of MLOps on the productivity of the Telepass data science team.

Figure 7. Total number of use cases in production in 2023

As of today, the team has leveraged the MLOps framework to develop, test, and seamlessly deploy more than 80 pipelines running every month, covering more than 10 different use cases.

These include precise churn prediction to forecast customer attrition in order to implement targeted retention strategies, propensity modeling for tailored customer interactions, and data-driven customer clustering strategies.

Through a streamlined deployment processes, the team has ensured the accuracy and efficiency of these models, which positively impact the Telepass customer experience and consequently the Telepass business.

Conclusions

MLOps adoption journey at Telepass has been thrilling, pushing our teams to improve efficiency, collaboration, and innovation, and cultivating a culture of continuous improvement and excellence within our organization.

However, this success is also the evidence to the many challenges Telepass has encountered and overcame during its journey. The path to success has not been without its obstacles. For example, due to limited resources, Telepass couldn’t establish separate environments for development, testing, and production as desired. Instead, Telepass innovated by implementing strict rules for code versioning, automated testing, and deployment to compensate for fewer environments. Emphasizing close alignment between development/testing setups and production became crucial for early problem detection.

Another challenge faced by the data science team lied in adapting to the novelty of the new MLOps framework. To address this, Telepass team prioritizes comprehensive onboarding to ensure a deep understanding of the ecosystem’s nuances. Additionally, the team implements a clear separation between infrastructure and operations, simplifying tasks for data scientists. Through open communication channels between ML engineers and data scientists, the team continually refines its approach, ensuring alignment with real-world needs.

In conclusion, the adoption of MLOps on Google Cloud has revolutionized how Telepass generates value from its data. But the journey is not yet over: striving for continuous improvement, Telepass eagerly embraces and seamlessly integrates the latest features from Vertex AI, enhancing both solidity and agility of the MLOps framework.

Special thanks to Vincenzo Genna and Alessio Palladino from Telepass, both personally involved in the technical journey and the drafting of this blog. Also thanks to Go Reply for spearheading and supporting Telepass throughout this adoption process, leveraging its core expertise to assist businesses in adopting Google Cloud services. And lastly, thanks to Ivan Nardini from Google Cloud for his collaboration, support and contribution along this MLOps adoption journey.

--

--

Gioia Sarti
Google Cloud - Community

Cloud Engineer with a keen interest in AI and MLOps. Leveraging tech to drive innovation and streamline operations. 🪐