Data is revolutionising the automotive industry and many competitive advantages reside in the platforms that support this data. Enterprises must find and align their procedures and methodologies to transform this data into valuable information. This is where machine learning comes into play, and teams combining analysts, data scientists, and data engineers work to achieve complete and automated development, integration, testing, and deployment of machine learning models. This blog introduces a complete machine learning framework that intends to break down the traditional monolithic end-to-end machine learning development into a distributed, modular, and scalable approach.
Stratio AI Framework is a custom made solution that leverages multiple open-source technologies and Stratio insights to support data ingestion from multiple sources, and uses this data for both real-time (streaming) and offline (batch) predictions in the vehicle fault detection and predictive maintenance fields. This solution allows Stratio’s teams to easily design, build, test, and deploy machine learning models at scale.
Before Stratio AI Framework, we at Stratio faced a number of challenges and difficulties related to building, scaling, and deploying machine learning models. Either because data scientists were using a wide variety of tools and procedures to create machine learning models, or simply because there was no process of engineering that conformed with the requirements and assets to take these models to production in an easy way. At a stage where multiple models are already producing insights for our clients, the amount of technical debt tends to increase exponentially.
Machine Learning: The High-Interest Credit Card of Technical Debt — Google
This approach impacted the delivery and effort needed to build reliable and scalable machine learning models, and a few data scientists and data engineers could not keep up the pace with the high demands and opportunities that a disruptive product needs.
Taking a step back, we could identify some key concepts that were causing this approach to fail:
- Monolithic Design
- Hard and Long Debugging
- Data & Code Inconsistency
- Nonexistent Abstraction Boundaries
- High Data & Code Dependencies
- Poor Configuration
- Lack of Reproducibility
To start designing an approach and properly identify the main stages that were the basis of developing a machine learning model, we needed to consider a set of major concerns:
Data acquisition from multiple data sources and the starting point for both empiric and machine learning analysis. We need to use sets of relevant and comprehensive quantitative and qualitative data.
A pre-processing phase where the data is transformed and starts to become information, ready to be used in machine learning models or deep analysis.
The data is usually split into subsets and used for algorithms and model training. The model will then learn with this data and acquire generalisation capabilities.
Generate model predictions based on a set/group of observations. Can be helpful to classify an already existing dataset.
Real-time & Streaming prediction
For real-time and streaming approaches the incoming data is quickly analysed and the model reacts to these events which can occur anytime.
Machine learning algorithms have never been better promised and also challenged by big data in gaining new insights into various business applications and human behaviors — Paper
Taking into consideration the issues previously addressed, we wanted to achieve a solution compliant with the best industry standards and which met our standards for availability, reliability, modularity, scalability, and performance. The following concerns were taken into account:
- Divide and Conquer: The challenges should always be broken down into smaller pieces when the complexity increases;
- Modular and Interchangeable: The solution should be modular and each component interchangeable;
- Aiming for Code Maintainability: The code should be simple and easy to understand and maintain;
- Don’t reinvent the wheel: There are many high-quality open-source tools out there;
- Can It Scale?: It is not a solution if it cannot scale properly;
- Data & Schema Evolution: Data changes often, so it should be easily updated in the future;
- CI/CD: Use continuous integration and delivery to automate the delivery processes;
- Improving Development Speed: Every component added should simplify/help a fast development;
- Going to Production: The transition from development to production should be straightforward;
The Stratio AI Framework consists of an interaction of different open-source technologies and in-house custom made components. The main open-source technologies used are Apache Parquet, Apache Kafka, Elasticsearch, Apache Airflow, Docker, and Kubernetes. These technologies have been chosen based on the maturity, community support, industry insights, and suitability to Stratio’s use cases. The integration with these use cases and among components that ensure our business rules are followed are supported by built-in tools based on Python 3 and .Net Core. These components have been built using software design patterns such as separation of concerns, dependency injection, and others.
We built our own Data Lake to store the millions of records collected daily from our client vehicles, saved in their raw and unprocessed form. The base technology used is Apache Parquet in a distributed NFS NAS, leveraging columnar format and high compression factors. The partitioning is also customised to ensure the best performance and achieve a high level of predicate pushdown. We also have relational databases to store entities and domains to be used across the entire ecosystem.
The data is continuously ingested by streaming queues using Apache Kafka and .NET Core jobs that ensure a first schema validation and also forward the data through the data lake in compliance with the current partitioning. Apache Spark can also be used to provide a large distributed data processing pipeline.
The Feature Engineering is supported by two different Feature Stores depending on the latency and throughput required for each use case served:
- Online Features — Features addressed in real-time and hot-path time window scenarios, often having a millisecond latency throughput. This data is stored using fast cache technologies like in-memory or key-value databases, in this case Redis. These features are usually used for real-time models or insights that only need the latest values for computation.
- Offline Features — Features in batch and cold-path time window scenarios. These features include large time-series analyses from several days to years of data. They are mainly used by heavier offline processes like model training, wide empiric analysis, etc. To store these features, NoSQL databases and systems such as Cassandra and Apache Hive get the job done.
These Feature Stores enable features to be registered, added, searched, and used for model inference and improve the solution, adding the following capabilities:
- Scalability: features can be scaled, the schemas updated and versioned;
- Reusability: by sharing them among analysts, scientists, tests and models;
- Consistency: ensuring the consistency between online and offline storages and both features available to the developers and models;
- Reproducibility: anyone (at any time) can achieve the same outcome by using the same data from a specific point in time.
The models can be developed and tested using training jobs that can be scheduled for continuously running training sets and validations. When a model is ready to be used, a new version is released into a model repository where the model library and related metadata (such as results, hyper-values, date, and execution time) are stored. The prediction services can then target a specific version ensuring a clear and visible version policy.
Prediction & Persistence
The models are then deployed and loaded by the batch or real-time prediction services. These services load the existing feature sets from the online or offline Feature Store and are responsible for generating predictions based on the trained model. The score, categorisation, and additional metadata are then persisted into another Apache Kafka queue ready to be used by multiple services and tools. These predictions can then be used as KPIs and stored in a more client-facing and view-oriented engine like Elasticsearch.
At every phase of this machine learning workflow, the artefacts and components log insights which are used for troubleshooting and continuous monitoring. By ensuring high coverage criteria of every component and functionality we can identify:
- When data is missing from a stage in the pipeline;
- When the input/output data changes, and how often;
- Whether a model is overfitting or underfitting;
Every model and component is deployed through packaging, containerisation, and orchestration techniques using CI/CD pipelines.
- Each component and functionality present in either Data Collection, Feature Engineering, Training or Prediction phases include code that is distributed using both Python and .NET Core internal packages;
- The packaging release is done by Jenkins pipelines that run all unit and integration tests in the process, and are responsible for releasing and incrementing the package version;
- The models are packed into a single Docker image with all the required dependencies;
- The model deployment consists of a Kubernetes YAML deployment file that gathers the model Docker image and ensures its availability by using replicas in a distributed cluster. The deployment can also be scaled by adding more instances of each model Pod;
At Stratio we are constantly looking for new ways to improve our processes, components and infrastructure, especially in this field where many new tools and approaches keep emerging in the industry. In fact, we have already started to research new initiatives including:
- Model orchestration using Airflow or Kubeflow, to enable the creation of DAGs for each model phase;
- Apache Kafka Streams or Apache Spark Streaming for deeper batch processing;
- Integrate Feature Stores like Feast and Hopsworks;
Last but not least, KUDOS to the entire Research and Software team for supporting this initiative and for their commitment to this project!