Pillars of Trust in AI: Measure & Quantify Your Machine Learning Practices

Published in

Bots + AI

8 min readMay 30, 2019

Join thousands of other forward thinking leaders learning at Bots & AI events together at https://www.meetup.com/Bots-and-Artificial-Intelligence/

David Piorkowski, Researcher at IBM, began by acknowledging that AI is exploding and is being incorporated into every domain.

Knowledge workers spend about 80% of their time searching and preparing data, not on innovating with Data Science and AI.

At IBM teams are tackling a number of things in the AI space — examples include scanning regulatory documents to ensure that businesses are compliant, identifying strategic initiatives by scouring articles, news and publications in the media world, developing chatbots to help people determine what is the right insurance policy for them and their particular needs, helping doctors in Nigeria identify breast cancer more accurately than they can just by hand. It’s a long list.

At the end of the day, AI is being used in industry essentially for three purposes:
* Increasing the value of the business
* Decreasing the cost of doing business
* Enabling you to do something new and different that hasn’t been done before

What causes AI outcomes to come into question?

AI techniques have been around a long time… but lately a lot of attention is being paid to stellar performance of deep learning, a subset of ML, which in turn is a subset of AI. David used the example of teaching a computer to identify a horse using neural networks. The machine learns by repeatedly running the process thousands and thousands of times and correcting errors along the way. At the end of this, the machine is able to dig a cluster of pixels in the hidden layers to identify the animal. The machine does not inherently know the difference between a monkey and a horse. It learns to make the distinction through the use of supervised or reinforced learning algorithms. The problem with neural networks and deep learning in particular is that mastery is not similar to the way humans understand or process information. Neural networks don’t capture abstract reasoning so there is a gap between what people think about when it comes to classifying images versus what machines think about when doing the same.

The explosion of deep learning can be attributed to 2 major aspects: the development of faster GPUs that are easy to program against and easy access to huge volumes of data to train these algorithms. Research shows that as of 2015 machines do better than humans in classifying images in aggregate. Humans can lose focus and make mistakes choosing incorrect tags or leaving tags out for example which overcomes the more sensational chihuahua or muffin style errors that machines make. This is both frightening and exciting depending on your perspective.

As machines get better at processing information and as algorithms improve — they become embraced by various domains to make important decisions that affect human lives. For example: determining a person’s credit worthiness, determining the limits in their life insurance quotes, or determining whether or not a person should be sentenced by a judge. It becomes important to explain what the neural network does so that decision makers using the instrument to make a decision can do so confidently. This is why trust in AI is important.

In recognition of this, the researchers at IBM have been working on the four pillars of trust: Fairness, Explainability, Robustness and Assurance.

Fairness / Bias

Fairness is sometimes called bias. This ‘statistical’ bias is not the same as human bias but is influenced by human activity. For example, when you ingest data into a model, it is highly likely that the model is biased in some way depending on the data — it might skew a little too much towards one category versus another. David went through a few examples in popular media to show where the idea of fairness stems from like Microsoft’s Tay bot for example. Bias is essentially a form of statistical discrimination that we as a society make an effort to discourage.

IBM developed the open source AI Fairness 360 toolbox which gives AI/ML developers over 30 fairness metrics and 9+ algorithms that help identify bias in data and also includes some cleanup tools. More information about these tools and metrics is available here: http://aif360.mybluemix.net.

Explainability

Explainability is about how you can help a person reason about an algorithm and what should you be thinking about when presenting it. The form that it takes depends on the consumer of the explanation — that is it must match the complexity capability and domain knowledge of the consumer. There are many ways to explain things, and explanations can be roughly bucketed into 3 categories:

directly interpretable versus post hoc interpretation (with the help of a companion model to the black box model)
global (at the model level) versus local (at the level of a specific instance of prediction)
static versus interactive with visual analytics (the user can interact with the model)

An example of the ongoing research at IBM in this space is the digital visualization lab that helps visualize the decisions that a neural network makes with clickable details at every node — such visualizations bring the model a little bit closer to how a human might be able to understand it.

Robustness

In the interest of time, the speaker was unable to address this pillar but further detail on robustness concerns can be found in this article.

Assurance

Assurance is about finding an industry standard for models to follow to assure people that things are being done well, correctly or in alignment with some kind of standard. (David’s team works primarily on this at IBM). David compared this to nutritional facts labels for food — IBM’s version of nutritional labels for AI models are called FactSheets. FactSheets depend on the service, application and/or user, but they try to cover the following types of questions:

What is the intended use of the service output?
What algorithms or techniques does the service implement?
Which datasets was the service trained/tested on?
Describe the testing methodology and results.
How was the model trained, and were any steps taken to protect the privacy or confidentiality of the training data?
Are you aware of possible examples of bias, ethical issues, or safety risks as a result of using the service?
Does the service implement and perform any fairness checks detection and bias mitigation?
What is the expected performance on data with different distributions?
Was the service checked for robustness against adversarial attacks?
When was the service last updated?
Recommended uses. Not-recommended uses.

These FactSheets will evolve differently in different domains, but the hope is that they will lead to an ecosystem of third party testing, verification labs, services and tools — similar to the Internet Certificate Model which became a “trust anchor” for the internet economy.

Check out Watson Studio (https://www.ibm.com/cloud/watson-studio) and Watson OpenScale (https://www.ibm.com/cloud/watson-openscale/) for more on this topic.

Traditional software lifecycle vs AI model updates

David then discussed the parallels and differences between traditional software systems and the AI lifecycle. Here are some of the highlights:

The model component adds a layer of complexity to the traditional software lifecycle — it involves data preparation, modeling and if you’re really lucky you have a chance to iterate and update that model.
Unit testing is challenging in the AI world — with each update to the model, it could spit out completely different predictions and it could get better or worse.
You have to contend with competing metrics and constrains than traditional software engineering — accuracy, precision, recall, coverage, fairness, explainability, robustness etc to name a few.
Testing is completely data driven (a model’s performance is as good as the data it is trained on) — and it is constantly evolving as the data is extended.
There are 2 very different teams involved and they often use competing processes — the data scientists are focused on improving and deploying the models and not necessarily integrating them into an application — the software engineers have their own lifecycle. But there is a very strong dependency between the two which has to be managed to mitigate risk.
Business guardrails for using models are an important consideration — for example, don’t use this model to predict credit worthiness if the individual is under 18.
Deployment of models differs from deployment of traditional software — definition of “better” in a statistical system is based on multiple metrics and updates can be beyond version replacement and could involve different ensemble models.
Models are very sensitive to concept drift… as the audience, the environment and the data changes, the model must take those changes into account.
Trust is a moving target as models continuously change — each change has the potential to have a widespread impact.
Similar to DevOps in software, model deployments require automation, artifact tracking and rigor.

It was yet another interesting talk with a very engaged audience asking a lot of questions. Thank you to David for being a great speaker and providing lots of material to read post talk. Articles included in David’s presentation have been compiled below for your easy reference.

We are also able to share his presentation and you can access that here: David Piorkowski’s in depth slides

Huge thanks to Bots and AI’s team member

Rucha Gokhale

for her work on this summary! Want to join our team? E-mail humans@botsandai.com

Reference links:

Examples of bias in AI in popular media

Sentiment analysis: https://motherboard.vice.com/en_us/article/j5jmj8/google-artificial-intelligence-bias

Image recognition: https://www.cbsnews.com/news/google-photos-labeled-pics-of-african-americans-as-gorillas/

Recruiting: https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G

Chatbots: https://www.npr.org/2016/03/27/472067221/internet-trolls-turn-a-computer-into-a-nazi

Recidivism assessment: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Word embeddings: https://arxiv.org/abs/1607.06520

Fairness

21 definitions of fairness and more (the need for an AI toolbox): https://www.youtube.com/watch?v=jIXIuYdnyyk

IBM AI Fairness 360

Web experience: http://aif360.mybluemix.net/

Code: https://github.com/IBM/AIF360

Paper: https://arxiv.org/abs/1810.01943

IBM ongoing work: Fairness In AI

Analyze, Detect and Remove Gender Stereotyping from Bollywood Movies: http://proceedings.mlr.press/v81/madaan18a/madaan18a.pdf

FairnessGAN: https://arxiv.org/abs/1805.09910

Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration: https://arxiv.org/pdf/1809.08343.pdf

Examples of 2018 Explainability Innovations by IBM Research

Improving simple models with confidence profiles NIPS 2018: https://papers.nips.cc/paper/8231-improving-simple-models-with-confidence-profiles

TED: Teaching AI to explain its decisions AIES 2019: http://www.aies-conference.com/wp-content/papers/main/AIES-19_paper_128.pdf

Boolean decision rules via column generation NIPS 2018: https://papers.nips.cc/paper/7716-boolean-decision-rules-via-column-generation

Variational inference of disentangled latent concepts from unlabeled observations ICLR 2018: https://openreview.net/forum?id=H1kG7GZAW

Explanations based on the missing: towards contrastive explanations with pertinent negatives NIPS 2018: https://arxiv.org/abs/1802.07623

Seq2Seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models: IEEE VAST 2018: https://arxiv.org/abs/1804.09299