Why your ML model is broken

Philippa Baliman
Datasparq Technology
4 min readMar 18, 2020

Where machine learning falls down… and a simple way to fix it.

ML and AI still have room for improvement

You’ve just deployed your brand new ML model to production; congratulations! Enjoy your success while it lasts, because it might be short-lived.

ML model performance degradation is an often overlooked but critical part of productionising data science solutions. We wishfully assume that, once deployed, we can wash our hands of our diligently constructed model and head to the pub for a few pints. By failing to plan for drops in performance, your hard work building the model is called into question as it slowly begins to fail to deliver the desired results. But with a little design forethought, these risks can be mitigated.

Model drift is most commonly encountered in predictive analytics, including machine learning models. In short, it means that for a model trained on target variables, performance will degrade as the predicted target variable diverges from its distribution in the training data. As your target data experiences natural fluctuation or a more dramatic shift due to external factors, so your model will no longer perform with its previous accuracy.

Training data is often more limited than we would want, and no matter how extensive your training set, it’s likely that live data will diverge from previous behaviour and display new patterns and trends not captured during training. Moreover, as the age of your training data increases, your model simply will not have been trained to give the most accurate representation of future data. While not a catch-all solution, retraining your model with a wider range of more up-to-date data will help to alleviate these effects on model performance and hopefully maintain a model that makes accurate predictions given the most recent trends. Knowing when to retrain is, of course, a little more subjective; we try to use KS-tests or a few other metrics to determine whether model-generated features and results have deviated beyond some threshold from the expected range of values.

When one of our old models, in need of a few extra features and some TLC, fell into my hands, I was less than enthusiastic about the prospect of digging out archived scripts and starting from square one again. So to save myself the hassle in future, and with the help of the new Houston serverless orchestration API that’s fresh on the market, I built an easy-to-use model-retrain pipeline.

Orchestrating my pipeline with Houston required only minimal code and could be used easily with GCP thanks to its platform-agnostic approach. At the push of a button, this pipeline will spin up VMs, run containerised retraining scripts, perform hyperparameter tuning and cross-validation, and even evaluate the expected performance of new models against the productionised version. Using custom VMs rather than a managed setup gave me the maximum flexibility over the parameters like memory, CPU and GPU capacity, and meant that I could instruct the individual VMs to destroy themselves upon job completion, avoiding cleanup work afterwards. Monitoring the success of each pipeline stage via the Houston dashboard gave us a greater overview of each run. For the sake of a little development overhead, this pipeline became an invaluable tool, allowing us to retrain models in parallel, quickly retrain on new features, and easily access model configuration parameters written automatically into storage.

The one tool that really helped to make this pipeline easily configurable was Houston. DataSparQ’s latest market release is a lightweight, serverless orchestration tool which works across platforms to sequence and manage dependencies of your pipeline stages. Using my retrain “plan”, a simple JSON file outlining the stages in my pipeline, I could add or remove processes quickly, allowing me to train any number of models in parallel as needed. Houston made it easy to connect autonomous components in separate environments without the need for heavy duty orchestration frameworks.

After building, the pipeline reads configuration details from storage, spins up VMs in GCE according to the Houston mission specification and funnels results into BigQuery and Cloud Storage

The architecture I use to run retrains consists of:

  • a Cloud Build to build and deploy my pipeline
  • a Cloud Pub/Sub trigger to kick off proceedings
  • model configuration JSON files in storage specifying which features I want the model version trained on
  • VMs that get created and destroyed in Compute Engine, Google’s IaaS offering, writing hyperparameter and threshold choices back to the configuration files, and writing train and test set predictions to BigQuery
  • Evaluation VMs to run analysis comparing the model’s performance against any other models provided in the specification was also written to both Cloud Storage and BigQuery for ease of access

If you’re anything like me, spending time on retroactive development can feel incredibly frustrating. Using a pre-built pipeline and a little advance planning to automate model retraining will help to drastically reduce the time you would normally spend sifting through results in your editor and copy-pasting results into Excel sheets (not that I would ever be guilty of this… 👀). Investing a little extra time into a clean model-retrain solution is key to ensuring the long-term viability of your ML models, and will go a long way towards freeing up your time for other tasks… like all that documentation sitting in the backlog.

--

--

Philippa Baliman
Datasparq Technology

physics grad ⚛ → tech startup 💻 emotionally attached to Jupyter Notebook