Machine Learning from P.O.C to Production
Part 1: ML product global architecture and definitions
ML based products are increasing over the years but they need to begin by a Proof of Concept (P.O.C) phase to display their feasibility, impact on customer life or increase of incomes. Once the project has proven its usefulness, you will enter the tiresome journey of industrialisation. Many companies stopped ML projects at this step, either by underestimating the effort necessary to obtain a steady product or by the deception of decrease in performance (compared to theoretical performances offered by the P.O.C).
Within these series of articles you will find the explanation of the causes of this decline, the solution and prevention you can apply, whether by concepts, methods or roles in your project.
This first article is an introduction to concepts and vocabulary before entering the next parts which will detail the problems and solutions over production life cycle.
- Features: transformed raw data into more meaningful information to feed the prediction model. It can be aggregation of one or more fields from one or more tables in a database. Example: the mean price of products over the last 3 months, with inflation correction
- Label: the information we want to predict (used for supervised training, we will not enter into details for other fields like unsupervised or semi-supervised training). Example: the risk over a bank loan
You can represent these information in a table like this:
- Feature importance: An ML model will associate weight(s) to each feature (that’s the training part). Depending on the weight associated, each feature will have more or less importance to the model. This can be measured most of the time and it is used for the explicability of the model (“why did it make this decision ?”)
- Product: the name of the app/project or service you expose to your customer(s)
- ETL / ELT: “Extract Transform Load” or “Extract Load Transform“ are the definition of steps in the data pipeline to transfer data from one source to another with transformation during the process
- Inference: Result of an ML model. Based on features, the model will return one or more results (probabilities, quantity, etc…)
When you create a new ML product, you can cut it into four parts:
- Data Pipeline: aggregate data from multiple sources, whether it’s batch, streaming, internal or external sources and expose it into a data warehouse.
- Feature Store: Entity which concerns everything about the features (feature pipeline, storage and serving of features). The feature pipeline takes the output of the data pipeline which takes data from a data warehouse and return calculate features, stored into a storage system (realtime or not)
- Model Service: Entity which will be used to (re)train, deploy, serve your model and monitor its performance
- Your product: An API (maybe with a frontend) used to call your model, either for real time prediction or for triggering batch prediction
Basically, when you create a product, you will expose a service (API or Front) for your users. Your product might be more complete and does not only expose ML services, but concerning those ML services you can encounter three types:
- on demand prediction/ real time inference: the user will send an information which will trigger the API to return a prediction, in real time. For example, on a commercial website with an estimation of delivery date.
- batch prediction: the user will trigger a job, which might take a few minutes or hours to make inferences on multiple cases and save it in a data warehouse. This type of use case is more often in internal products of a company, like marketing projects when you estimate a customer’s score. Your marketing team triggers the job and gets the result from the front / dashboard, connected to a data warehouse containing the output of the job.
- A combination of batch / on-demand: sometimes, you can do batch inference of a scheduled date and publish the result in a real time database. The final users will only send information (like a customer id) to get the specific information they need in real time. It’s helpful when you want realtime results with a high performance in response time, but you need all the data beforehand.
Roles in the project (with impact on ML parts):
- Data Scientist: conception of the P.O.C by evaluating the feasibility, studying the data useful for the project, their quality and propose one or more algorithms (or statistical methods) to answer the problem. One of his main tasks will be to search and converse with the business to understand the subtleties behind the data to obtain a proper dataset to train/validate and test a model.
- Identify data sources and study their quality
- Define one or more algorithm to use
- Train and decide with business people the thresholds of performance
- Study shift / drift of features to re-train or change the model
- Data Engineer: The data engineer will orchestrate the recuperation of raw data from one or more data sources (database, messaging, etc…) to constitute a reliable work base for the ML engineer. It will also test the quality of raw data.
- Collect data identified by the data scientist and apply filters
- Orchestrate the data pipelines (most of the time to a data warehouse)
- Maintenance and documentation of the structures (schemas, data lineage, etc…)
- Responsible for raw data quality metrics
- Manage CI/CD of the data pipelines
- ML Engineer: Also called MLOps, it will be the wildcard on the ML industrialisation.
- Creating and orchestrating ML pipelines as defined by the data scientist.
- Ensure scalability, reusability, composability and portability of the solution.
- Manage CI/CD/CT of the ML pipelines and the monitoring of the solution
- Responsible for the quality of the features and the monitoring of the model
- Ensure the maintenance and documentation of those tasks
The next part will explain the responsibilities of a Data Engineer inside ML projects. Why Data Engineer you may ask? Well, an ML product is more than just a model in production. Data are the main reason for performance stability and a Data Engineer’s work has an impact on it.