A framework for evaluating ml automation
End-to-end ML platforms are the new kids on the block for many enterprises. They are early in maturity but produce many impactful results through ML automation. So how do we measure the platform’s effectiveness in driving ML automation?
I have found the framework of the level used to describe self-driving autonomy simple to understand and communicate. Let us try to follow a similar framework for ML automation. Note developers can define these levels as they see fit. The following is an example only.
Level 0: No automation. Developers are heavily involved in the babysitting model lifecycle, and subsequent retraining runs.
Level 1: Automated training/retraining supported. ML developers can configure periodic runs for their pipeline at specific events: dataset change, time (monthly, weekly, daily, etc.). This is equivalent to continuous integration.
Level 2: Automated model deployment. ML developers can trigger automated pipelines to evaluate, experiment, and deploy a trained model to serve production traffic. This is equivalent to continuous deployment.
Level 3: Developers can add automation to the post-deployment lifecycle — monitoring, traffic sampling, labeling, and dataset generation.
Level 4: At level 4, we have a generic ML automation system as a part of the platform. Developers can write custom automation for parts or the whole of the ML pipeline. There is a steady increase in ML pipeline CI/CD. Note that a generic ML automation system can cover all levels (0–4).
Now, coming back to the original question:
How do we measure the effectiveness of the platform?
- Maturity of ML pipelines: One can start with the distribution of ML pipelines grouped by level of automation. Usually, levels zero and one will have the bulk of pipelines at the beginning. The ML platform team can take goals to shift the bulk of these pipelines to levels 1,2 & 3.
- Model freshness: How old are your production models? ML Automation should reduce human effort considerably and improve the time to release models. This can be measured quite well.
- Track human-induced ML production errors. This is a lagging indicator that should go over time.
If you are building ML automation, start with an evaluation framework with clear metrics.