MAGI: ViaHub’s Machine Learning Platform

6 min readJul 21, 2023

Written by: Team MLOps

Introduction

In a world where the number of Machine Learning models grows exponentially, the importance of a Platform that accompanies its entire lifecycle becomes increasingly necessary, especially in companies with several productive Machine Learning models.

When we talk about modeling and deployment, we have something even more challenging, as each model has unique treatment, training, implementation, monitoring and maintenance requirements. Managing all these processes manually can be an extremely difficult and error-prone task, which can imply in the tracking and reproducibility of the models, generating significant losses to the business as a whole, losses that can range from poor time management, waste of resources to negative impacts on decision-making based on solidarity data by these models.

But after all, what is an ML Platform?

A Machine Learning Platform aims to solve the complex pains mentioned above, managing the entire lifecycle of the model, from the development stage, with model management tools, telemetry and resource reuse, gaining wide prominence in the production stage , where there is a great need, such as the protected deployment process and systemic monitoring of models, data and KPIs.

MAGI

We will present MAGI, ViaHub’s ML Platform, an ML Platform designed to cover the end-to-end needs of our data scientists with products designed specifically for the context of Machine Learning.

Among these needs, there is the peace of mind that their models are running in a safe, scalable and observable way, allowing them to invest their time in the development of new models or improvement of existing models. Also, reliability is an important point, knowing that your model continues to deliver data is an important point.

To meet all these needs, we have three modules on the platform that are integrated, namely Casper, Balthasar and Melchior. Each of them with their products. Next, we will go into more detail about the composition of the MAGI modules and products.

Note: MAGI has a visual identity that was inspired by the supercomputers of the anime Neon Genesis Evangelion.

Casper

Whenever we talk about MLOps, we mean deploying models in production, right? 🤭

The Casper module seeks to address the pain related to deploying models. Developed with the objective of being robust, scalable, reproducible and agile, all of this using the best practices and technologies in the market.

Casper’s products are the Deploy itself and the CI/CD Pipelines associated with it.

CI/CD Pipelines (Continuous Integration/Continuous Deployment)

Deploying manually is an attitude of the past, right? 🤭

Our CI/CD Pipelines were designed to gain speed, scalability and reproducibility in deployment, as well as eliminate wasted time and resources by using code standardization and software best practices.

We primarily use GitHub Actions to perform our CI/CD functions. To learn more in detail how to operate our pipelines, visit this post (available in Portuguese).

Deploy

Does keeping the model code running on the laptop itself seem safe for your company? 🫣

MAGI’s deployment process was designed to be reliable, fast and auditable, reducing the time needed to put models into production and minimizing the risk of errors and failures.

“After we standardized our process, the time to deploy dropped from 2 weeks to just 3 days” (ML Engineer)

To learn more about how our treadmills work, visit this article.

Melchior

Even taking man to the moon, there is no point in a rocket that cannot be monitored and have its health checked. With models it’s the exact same thing 🌕

The Melchior module is focused on increasing the quality of production models through observability. Melchior has products related to telemetry and table monitoring, model drift and data drift.

Drift Monitoring (Models, Data and KPIs)

Fortunately (or not), reality changes, and with it, the data also change, why do the models remain the same? 👀

Drift is a phenomenon that happens whenever the change in the distribution of historical and current data happens.

Drift Monitoring is a product that seeks to monitor the health of models, data and business metrics in production, providing the necessary statistical tools to support this monitoring.

In essence, Drift Monitoring seeks to provide visibility of when the content of a data set may have changed over time, meaning a change in behavior or a failure in the data calculation process, or even to provide visibility of when a ML model is losing predictive power and needs to be reassessed.

“Drift monitoring can not only monitor models, but also understand when business KPIs change significantly” (Data Manager)

We have made available within the Drift Monitoring several statistical functions implemented in PySpark, such as KS (Kolmogorov Smirnov), PSI (Population Stability Index), Performance Tracker and Domain Classifier.

Telemetry

You know when something happens in production, but you don’t know what happened? 👻 For models, this is the world without Telemetry

Telemetry seeks to generate information in a simplified way for the visibility of existing ML Flows in ViaHub. The telemetry process was thought to be the friendliest, easiest and most agile for users, being very practical for Python users.

“Using telemetry is very practical, you can monitor your application easily” (Data Scientist)

To obtain health information from ML applications, Melchior Telemetry can be used in a generic way, adding only the project decorators and metadata, the ML code functions are now stored and analyzed.

Data Quality

You know that silent mistake that only your client area will be able to capture 🫣? So, it can be monitored with our Data Quality process

Table Monitoring seeks to monitor silent failures generated by anomalies and deviations in the output data of ML Models, thus preventing faulty executions from going unnoticed.

As an example, an anomalous case that would be a silent failure that would execute successfully is a table that always adds 1.5 million new records daily, if that table started to receive only 200 thousand new records per day, or worse, received 2 .1 million new records at runtime, an alert needs to be generated to investigate the cause. Table Monitoring fulfills this role.

“We were able to make our Data Engineering deliveries more robust using Data Quality by simply applying Melchior Table Monitoring” (Data Engineer)

With these features, you can quickly detect and correct problems, ensuring that your data is always operating with high quality and accuracy.

Balthasar

Like any good sailor, knowing the sea is essential 🛳

The Balthasar module seeks the best tools that can compose the model development cycle, currently having the Feature Store product.

Feature Store

Have you ever thought about a library with the best information ever, but disorganized 📚? The Feature Store organizes and centralizes this.

The Feature Store is a solution for managing and sharing features used by machine learning models, helping to reduce the complexity and time required to develop machine learning models, as well as increasing feature reuse and improving the consistency and quality of the data used by the models.

“We were able to track the entire lineage of our feature, now whenever a change needs to be made we know which model can be impacted” (Data Analyst)

Final Thoughts

MAGI, which is Viahub’s Machine Learning Platform, represents a significant step forward in the maturation of lifecycle management of machine learning models in production, offering innovative features and a complete approach to deploying machine learning models in production , from the development phase to the operation phase. This helps reduce the time required to get models into production, improve model quality and reliability, and reduce the risk of failures and errors.

With MAGI, the data team has a powerful tool to accelerate the development and operation of machine learning models, helping to transform data into valuable insights and improve business efficiency in several areas.

References

Kreuzberger, D., Kühl, N., & Hirschl, S. (2023). Machine learning operations (mlops): Overview, definition, and architecture. IEEE Access.

https://arxiv.org/ftp/arxiv/papers/2205/2205.02302.pdf

Symeonidis, G., Nerantzis, E., Kazakis, A., & Papakostas, G. A. (2022, January). MLOps-definitions, tools and challenges. In 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC) (pp. 0453–0460). IEEE.
https://arxiv.org/pdf/2201.00162.pdf

Google Cloud (2023). MLOps: pipelines de entrega contínua e automação no aprendizado de máquina
https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning?hl=pt-br#mlops_level_0_manual_process