Continuous Learning for safer and better ML models
Software developers are avid users of Continuous Integration. Every time they push changes to their code repository, a testing pipeline gets triggered and executes a suite of tests: unit tests, functional tests, integration tests, building assets, etc. Some even automate production deployments through these pipelines (Continuous Deployment).
CI/CD is the safety net that enables software teams to move fast in a safe way. If they accidentally break something, the CI pipeline will let them know in minutes and will not authorize shipping the unsafe changes to production.
ML models are increasingly used in safety-critical scenarios: robotaxis, industrial automation, fintech, healthcare, etc. They can provide tremendous value to businesses, consumers, and society at large, but can also constitute dangerous liabilities in case of failures. Imagine a robotaxi’s object detection model that suddenly does not recognize pets, or a healthcare vision model that mistakes regular organs for cancerous tumors, or a facial recognition model that suddenly shows bias against certain facial features. That would be devastating for all parties involved.
How can safety mechanisms from software development be applied to Machine Learning development?
At Sematic – and in our past work at Cruise – we noticed two main dimensions along which ML training pipelines should be automated.
As any scientist knows, only one system parameter should be changed at a time for rigorous experimental results. In ML, the parameters are data (training dataset) and code (data processing code, model training code, evaluation code, etc.).
Regression Testing for safety and performance
Regression Testing in ML is very similar to functional testing in software. Developers establish a dataset of must-pass scenarios (the more the safer) and continuously evaluate the newly trained model against them to ensure that no regression were introduced.
Every time a code change is pushed (feature code, data processing code, training code, sampling, evaluation, etc.), the end-to-end training pipeline is triggered pointing at a reference unchanged dataset (sometimes called “golden dataset”), and inferences are generated for each must-pass scenarios. If all pass, there is no apparent regressions.
In practice, training pipelines can be quite expensive to run (e.g. GPU, networking), so it may be more practical to run Regression Testing pipelines at a fixed cadence (e.g. nightly) instead of upon each code push like a traditional CI pipeline.
Continuous performance improvement
Making sure that no regressions are introduced is not sufficient. We also must ensure the model improves over time by learning from its mistakes.
To that end, it is necessary to establish a feedback loop from production inferences back to the training dataset.
As models generate predictions (inferences), hopefully the majority are correct, but it is likely that some are not. In an ideal world, these can be easily identified, labeled, and added to the training dataset for the model to learn from those failed examples. In reality, identifying incorrect inferences can be challenging due to the absence of ground truth for these inferences. For example, if a translation model incorrectly translates a word, who or what will signal this error? Some systems enable human feedback (e.g. Google Translate has a Feedback link to signal errors), others leverage auto-labeling, or error mining, using heuristics (e.g. the user has not clicked on any of the recommendations, or the robotaxi had to swerve around an obstacle at the last minute).
These failed examples must be identified, labeled (to establish what the model SHOULD have predicted), and added to the canonical training dataset so that the model can explore the long tail of more unfrequent events that were not originally sufficiently represented in the training dataset.
Continuous Learning with Sematic
Sematic is an open-source Continuous Learning platform developed by ex-Cruise Engineers. Sematic lets ML Engineers develop end-to-end pipelines to implement their Regression Testing and Performance Improvement strategies.
Sematic pipelines automate all steps required to go from labeled data sitting in a Data Warehouse, to a fully trained and tested model ready for deployment. They can be used to retrain models on code changes and check them against must-pass scenarios (Regression Testing), or automate Error Mining, labeling and retraining on the new dataset.
Sematic provides a smooth transition from prototype code developed in e.g. Jupyter Notebooks, to end-to-end production pipelines thanks to its Python-centric SDK and local-to-cloud development experience.
Check us out at sematic.dev, star us on Github, and join us on Discord to discuss Continuous Learning.