The Future of Machine Learning Depends on Auditing of Our Models

Rob Delaney
Machine Learning in Practice
5 min readAug 17, 2020
Photo by Scott Graham on Unsplash

The best tools are born out of user necessity. AI Auditing was no different.

End to End machine learning follows this general workflow (albeit not in a purely linear fashion):

At Infinia ML, we help our clients orchestrate data pipelines, build models, and deploy them into production — specializing in text and document-related use cases.

But all the steps in that process are just a means to an end. The goal of any data science project should be to improve the business in some way — via predictive analytics, process automation, insights that power better decision making, etc.

As I’ve written before, a machine learning model is not like traditional software. With an ML model, you cannot just set it and forget it. Deployment might feel like an accomplishment, but when viewed through the lens of “improving the business,” deployment is really just the beginning! Once deployed, the business must use the model to generate value. More importantly, the owners of the model must maintain the validity and reliability of that model. Not everyone appreciates this until they’ve gone through the pain of putting models into production.

After putting models into production, you might start to face questions like:

  • How is my model performing in the aggregate? Is it running the way it was designed?
  • Are we seeing data or concept drift?
  • Is my model absorbing any unwanted bias?
  • Are my data sets reliable and robust?
  • Why is my model spitting out this prediction or recommendation?

These questions can be summarized with an overarching query: Is this thing working?

So our workflow now has a fourth major category: Business Impact

It sounds obvious, but with 80% of ML/AI stuck in the science project phase, most businesses have not reached the point yet where they need to — or even think to — ask the questions that aggregate into business impact. Today, most data science coding is done in a present mindset (trial and error, solve the problem now), with little thought to what things will look like in production. In addition, the way ML/AI is typically described (e.g. “the algorithm reads the contract”) does not represent what actually happens in production, making the true risks and complexities easier to miss. But as the industry progresses and more use cases see the real world, questions around performance, reliability, and bias are becoming more and more important every day.

The issue of model “explainability” is well documented. Some AI systems are hard to understand. Advanced algorithms, particularly deep neural networks, are essentially a black box. It will be a long time before they are interpretable. However, there is a lot one can discern about performance and reliability (of both data and models) today, no matter how complex underlying algorithms may be. That’s because all models have inputs, outputs, and mechanics that can be scrutinized.

In truth, many data science models are based on some stakeholder’s human intuition about what variables are predictive or important. These human choices are understandable and auditable. Likewise, there is a way to validate underlying performance and machinations against the stakeholders’ intuitions and assumptions.

Auditing AI systems requires understanding of the following attributes:

1. Model KPIs

Statistics on how the model results are directly impacting the business. For example: number of emails read → emails flagged → time and money saved reviewing. In the heat of ML complexity it can be easy to forget the goals behind deploying the model in the first place. Model KPIs demonstrate performance against those goals.

2. Model Diagnostics

Continuous assessment of model health based on prediction metrics, and alerts to model owners about any unexpected changes.

Are the models constructed appropriately for the task at hand? Are they functioning as they were intended?

Note that we rate this the hardest attribute to audit; model diagnostics requires nuanced effort from skilled data scientists.

3. Model Governance

Usage statistics, compute consumption, user logs, container security, and information security anomalies.

4. Data Pipeline(s) Auditing

Identification of data sources and schema, as well as how each piece of data is pre-processed before reaching the model.

Tracking of all changes made to the data pipeline to pinpoint the sources of errors.

At Infinia ML, we tackle these four categories via a combination of software and managed services. For example here are screenshots of Model KPIs and Model Diagnostics on the Infinia ML Auditor Dashboard

Sample Performance KPIs
Sample Model Diagnostics Output: drift of samples based on out of vocabulary fractions using the Anderson-Darling test.

Every model and data pipeline has nuances, so there is no automated, 100% software-defined method for doing this. However, software can do a lot of the work. And when software is combined with data science expertise to fill in the gaps, we can make major progress in evaluating the efficacy and objectivity of our AI/ML systems.

As an industry, we have made incredible progress along the AI journey — from data prep to deployment — and we are starting to see impact everywhere in the economy. But in the long run, nothing will be more important than the tools and methodologies we use to evaluate and assess our models. Model auditing is an existential question for the industry. At Infinia ML, we are doing our part to help answer it.

--

--