Fighting off ML Bias

Robert Sibo
Slalom Data & AI
14 min readJan 27, 2021

--

Artificial Intelligence, Machine Learning more accurately, has truly spread across industries and business functions. Granted, with vast differences of maturity and success. What we’re seeing is that returns are coming from systems that have successfully built innovative models, built up an ML & Data Ops foundation to support continuous improvements, and a business model that can leverage it appropriately. With more and more companies doing this we are now seeing that not all models are equal and in some cases risk of legal or brand damage can occur if models show signs of bias or unfairness.

“My bet for the next 10 years is that the most competitive businesses will be Responsible AI-driven companies.”

— Lofred Madzou, AI project lead at the World Economic Forum

This paper assumes you have the data scientist, ML engineers, MLOps, and everything in place to productionalize your needs. And you’ve implemented a few critical models or a system of many super-focused models working in conjunction, but you’re up at night with a fear that not much is in place to ensure ethical and fair results from the model. And that you could be managing a time-bomb if some governance isn’t put in place but wondering “what can be done?”

Image 1: Typical Sources of Bias in ML Systems

Bias, and by loose extension unethical usage, of machine learning stems from typically 5 areas:

  • Bias in the Organization — Is the business, industry or culture already full of bias/discrimination in the way things operate? It’s likely one may not even know if this is the case, but in organizations such as policing or college admissions there have been enough studies to show inherent biases exist. These should be dealt with before an ML system is built up around them, they will be replicated and typically amplified by the model in this situation.
  • Bias in the Problem — Is the problem defined in a way that will consciously or unconciously discriminate? For example, excluding mortgage applicants, because they’re not using the web site and didn’t approve sharing PII. A model might prioritize “safer” digital candidates where it has a complete view of the applicants personal data and past transactions versus ones that require a paper application provided in a branch.
  • Bias in the Data — Is there bias found in the training data used coming from sampling or collection issues? Profile it, interrogate it based on potential risk areas (e.g. gender or race) to look for data imbalances or outliers that could lead to bias in the trained ML system.
  • Bias in the Model — Is there trained bias caused by the model’s design or tuning, beyond ones inherited from the previous areas discussed above? Some algorithms are easier than others to peek into the model’s workings, but in general, ML models are seen as black boxes making a review of the inner-workings impractical.
  • Model Misuse & Incorrect Generalization — Were the models extended for use cases or data sets that were not intended for during training? Data describing how the world works changes over time resulting in potentially incorrect and biased results from models that may have operated as intended before. These issues can cause problems since most ML models are very narrow by design and don’t generalize well as underlying data and use cases change.

The Public Relations Nightmare

Consider a hypothetical university with an online learning platform. You might see the following set of ML models working together to improve student and university outcomes. More than just one monolithic model, the head of analytics here has a complex set of models working together as part of a student experience plan. This is becoming more and more common as we find it’s easier to maintain many narrow models compared to one big one.

A typical long list of applied models might look like this:

  • Planning & optimization models to create a personalized syllabus for students
  • Classification & regression models to monitor and forecast students based on progress
  • Conversation agents for forums and student support
  • Sentiment-analysis to detect student emotions and determination
  • Recommender systems to suggest additional courses and further readings
  • Classifiers and NLP techniques for automatic e-assessment of assignments
  • Etc….

The risk of something going wrong and offering up certain courses based on gender or grading based on an unconscious preference towards student segments or response styles becomes entirely possible in today’s world.

While this example above is hypothetical, recently during COVID-19 in the UK, the Office of Qualifications and Examinations Regulations (Ofqual) trained a model to grade students entering into university based on a poorly formed problem, objective definition, and poor data leading to 40% of the students doing worse than their teachers predicted. The resulting chaos around the “F#$% the Algorithm” protests is a text-book example of the results of not factoring in bias risks from the start. [14]. Upon inspection, it disproportionately hurt working-class and disadvantaged communities and inflated scores for students from private schools!

To add complexity, imagine two or more highly-correlated features such as ‘time spent answering questions’ and ‘primary language spoken’. A model should, potentially, infer based on ‘time spent answering questions’ and not ‘primary language spoken’ to topic mastery; even though the two are potentially heavily correlated.

Showing that a model is putting substantially more importance on defendable features and less on controversial ones like economic, nationality or gender would be very useful for the university when backing up a model’s robustness and impartiality.

In general, our poor head of analytics needs to guard against the following scenarios [5] where AI :

  • Unfairly allocate opportunities, resources, or information
  • Fails to provide the same quality of services
  • Reinforces existing societal stereotypes
  • Denigrates people by being actively offensive
  • Over or under-represent groups

Model interpretability and ethical AI

Much like in law where the burden of proof falls on the disruptor, we’re seeing a new set of burdens or obligations arise for data scientists looking to innovate. ​This is an area that all the major players, including DARPA and Big Tech, are investing in heavily.

We’ve explored the risks associated with the data and analytics used within an analytics-driven organization in a previous post. Due to GDPR and a resurgence of government regulations and focus on consumer rights, the risk is real for organizations that used to hoard data and create intrusive ML models. A common practice for more than a decade. Now organizations need to put in place safe-guards against:

  • compliance/regulator fines
  • impact on brand/reputational strength
  • public perception of discrimination or unethical behavior

A recent Wing VC survey of data scientists [11] found model explainability was respondents’ top ML challenge, cited by greater than 45 percent of respondents, with data labeling a distant second (29 percent) and model deployment and data quality checks rounding out the top 4.

At Slalom we approach this proactively so that ML systems can be designed to be transparent as possible with an awareness of risks present. Leading to a system that can better operate in a sustainable and fair fashion.

Image 2: Slalom’s Sustainable ML Framework

A Turing Test for ML Ethics and Fairness

The more complex a model or an ensemble of models becomes the harder it is to look under the hood to question the fairness or even logic used. Similar to the working of the human mind; it’s way too difficult to guess at what’s going on with all of the tiny chemical reactions. So we base judgments of intelligence and ethics on actions observed of the black box (i.e. our brain). The original Turing Test sought to define a general test for intelligence leveraging only observed signals coming out of the black box.

Following on this approach, for ML model transparency there are roughly two main & related approaches: SHapley Additive exPlanations (SHAP) [12] and Local interpretable model-agnostic explanations (LIME). In general, they tweak the inputs one at a time a bit and measure the impact on the model’s output to try and build, per feature, an indication of which features have the greatest influence on the model’s prediction or classification. This should sound a lot like feature selection when building a ML model….

SHAP was developed by game theorist Lloyd Shapely who explored how important is each player is to the overall cooperation [outcome], and what payoff he or she can reasonably expect.

“Traditionally, influence measures have been studied for feature selection, i.e. informing the choice of which variables to include in the model [8]. Recently, influence measures have been used as explainability mechanisms [1, 7, 9] for complex models. Influence measures explain the behaviour of models by indicating the relative importance of inputs and their direction.” [7]

Image 3: Shapely Values illustrated overview

For example, for a binary classifier, high positive shapely values (the red or blue values above) imply the feature is more likely to be “1” while a negative shapely value for a feature implies it’s contributing to an outcome that is likely to be a “0”.

A group from Carnegie Mellon University in 2016 introduced a family of Quantitative Input Influence (QII) measures that capture the degree of influence of inputs on outputs of systems. Leveraging Shapley values and other techniques they tried to calculate the marginal influence a feature has by itself and jointly with other features on the outcome.

Coming up with a set of generalizable QIIs and reports is a step in systematically monitoring models to understand if bias is creeping in or at least that the “right inputs” are more influential for the model based on human judgment.

Modern tools provide some predefined metrics (aka QII’s) to cover data and model biases.

Note how the first two sets of metrics don’t even look at the actual inner workings of the model itself. These are black box measurements. Some of the latest tools coming out of the big tech companies include ways to looking into neural networks to peer into how the weights and connections work together — for example Google’s XRAI which builds out a heatmap of sorts based on a input node’s importance, or any layer in a neural network.

Bringing this out of academia into the real-world

Bringing this to the real-world we are now seeing various software tools coming available in 2020. One such tool seeking to address this is Truera (raised $12 million in VC funding for its AI explainability platform). While useful for deep learning it’s equally relevant for all classification and regression models out there.

‘We study the problem of explaining a rich class of behavioural properties of deep neural networks. Distinctively, our influence-directed explanations approach this problem by peering inside the network to identify neurons with high influence on a quantity and distribution of interest” [4]

Truera has some great case studies that can help provide real-world references [6][7]. Their platform will likely be a strong tool for assessing bias in ML.

Image 4: Truera report showing model disparity for gender and influence analysis for various features such as income and department

In addition to Truera and other niche tools in the market, Google Cloud and Amazon Web Services announced their own solutions. Google Explainable AI and AWS Clarify tool during re:Invent 2020. They implement many of the concepts above, and more, to let you explore the influence analysis of selected features on an ML model and to analyze data sets for potential biases.

Image 4: Google Cloud’s XRAI is a new way of displaying attributions that highlights which salient features of an image most impacted the model, instead of just the individual pixels.
Image 5: AWS Clarify report showing feature influence for a trained binary classification model

Solutions from Google Cloud, AWS, and Microsft Azure will provide a compelling set of tools for those organizations already working within their respective platforms.

Ethical and fair ML

A model is tuned to maximize or prioritize a narrow set of objective metrics. Fairness can be seen as a form of unconscious bias stemming from data sampling, feature selection, or model design being prioritized incorrectly (unethically) to optimize a valid, but perhaps myopic objective. The problem is that in modern times this objective is usually to make the most money for the company — not necessarily to maximize the life or disposition of the consumer or citizen.

In one study conducted by Anna Jobin [9] they loosely defined AI Ethics by:

  • Fairness
  • Accountability
  • Transparency
  • Privacy
  • Ability to rectify or challenge *

*We added the last and it stems from a “right to reasonable inference” [10] which attempts to bring rights similar to what GDPR outlined for data. Decisions of a ML system should be open to challenge just as you’d expect to discuss with a banker the results of a mortgage application.

Ethics and fairness describe what’s right or wrong, or what’s truly in some person’s (actor’s) best interest. There are defined by social norms and cannot be framed up as a problem solved by metrics or even a ML model. Humans in the loop will remain part of this for the near term. But this doesn’t necessarily mean ML systems cannot be automated or self-sufficient, in a way.

John Hooker and Tae Wan Kim [1] begins to define what ethical means for ML in their paper “Truly Autonomous Machines are Ethical.” With SHAP/LIME testing it’s still dangerous to make assumptions on the ethical and causal explanation of the model, since these tests are essentially hypotheses themselves based on the model’s performance. However, after influence analysis of the model a person can weigh the results against business, government, ethical and other drivers to gauge ethical risk.

As guidance for building autonomous ethical ML systems, they provide a number of principles and rules. The ML system will interact with other “actors” or “agents” and follows an “action plan” which the ML models generate.

  • Reasonable Proof — They propose that if a model, in at least one scenario, could have made its decision using fair and ethical logic this may be adequate enough, even if there are other less-ethical scenarios possible. An ambulance (the actor) running the red light (the action plan) could either be doing it appropriately to save a life, or it could be acting unfairly to bypass traffic and speed up the drive back to the garage. As a society we generally assume the best intentions in this scenario.
  • Principle of Respecting Autonomy — It is unethical for me [the ML system] to select an action plan that I believe interferes with an action plan of another agent.
  • Rule of Generalisation — It states that if you extend a rule [the ML system] across the broader population would it generalize well — for what it was designed for? Would any assumptions or learning feature weights become unethical if applied to a population in a different country, age group or simply at the global level? A mortgage model built on US data should work well in Japan, for example, if the model doesn’t rely on race or geography heavily.
  • Principle of Informed Consent — An ML system is operating as designed and potentially autonomous even if humans interfere with its action plan temporarily IF the ML system is designed to handle such interruptions gracefully. This human in the loop design for ethical judgment calls acts as a proxy for providing ‘informed consent’ to the humans while maintaining the model’s robustness.

Putting this into Action

Let’s talk about practical ways of putting in place a foundational level of safeguards for ML systems in your organization.

  1. Examine the objective of the ML system. Is the problem statement itself or the business it’ll be operating within going to be a source for bias or unethical behaviour for even the best-designed ML system? Even if that can’t be changed the bias can be identified and potentially worked around by smart data sampling or model design.
  2. Examine the data used for training and monitor production data ongoing. Does the data typify what is expected in the general population? Are there inherent biases already in the data sets that should be documented and perhaps planned for during model build? In an effort to remove that chance of discriminating with sensitive PII many companies have removed all information in their systems that are important to track and identify systematic racial, gender, sexual orientation, age-based biases. This can also cause problems when trying to gauge if there is bias in place.
  3. Examine the model and calculate the metrics such as QII’s defined above. Like a grid search, set up a series of tests to demonstrate if the model appears to be acting in a fair way based on a sampling of data that typifies expected demographics or scenarios in the population. Stratify the data based on ethical dimensions such as gender or race and see if the model holds up. You’ll need a simplified and trusted set of tools to implement a SHAP/LIME based approach to monitor for bias and unethical use.
  4. Build sensitivity reporting and alerts. Once deployed you’ll likely need a way to continuously monitor the data & model to see if drift occurs reducing performance and potentially even worst, introducing biases and unethical behaviors due to shifting data in the population. This is often done by a group that did not build the models themselves. Explore automation of this using tools such as AWS lambda functions, AWS Lookout for Metrics or similar tools that can monitor for certain deviations from expected results and alert you for analysis.
  5. Identify an Operating Model. Who will be responsible for monitoring and mitigating the risks of these complex ML systems to a business? In financial services, for example, equity or currency trading, risk analysts are employed to quantify and monitor the risks associated with traders and wealth managers. In near real-time. For ML systems, the risk analyst would monitor more than the simplified “mortgage risk model” people talk about. They’d need to monitor individual models as well as macro systems for risk and impact of things like COVID on the system and the company.

Gartner predicts that, by 2023, 75% of large organizations will hire AI behaviour forensic, privacy, and customer trust specialists to reduce brand and reputation risk.

This article merely touched on the great research and tools coming out of academics and large tech industries like Amazon and Google. Further reading is included below but ultimately get out there and start the conversation to learn more and run some pilots. These checks, tools, and stage-gates will become a common and critical part of your MLOps.

Slalom is a modern “digital and cloud native” consulting company with a deep appreciation for all that data and analytics can bring a company. Across our offices globally, we help our clients instill a modern culture of data and to learn how to respect the role they play as owners and stewards of it.

Rob Sibo is a senior director in our Slalom Sydney office, formerly Silicon Valley, and leads Data & Analytics consulting for Australia.

rob.sibo@slalom.com

References

[1] “Truly Autonomous Machines are Ethical” John Hooker, Tae Wan Kim: https://public.tepper.cmu.edu/jnh/autonomyAImagazine.pdf

[2] AWS Clarify Overview & Tutorial: https://aws.amazon.com/blogs/aws/new-amazon-sagemaker-clarify-detects-bias-and-increases-the-transparency-of-machine-learning-models/?sc_icampaign=launch_amazon-sagemaker-clarify_20201208&sc_ichannel=ha&sc_icontent=awssm-6483_reinvent20&sc_iplace=banner&trk=ha_awssm-6483_reinvent20

[3] “Algorithmic Transparency via Quantitative Input Influence” Anupam Datta, Shayak Sen, Yair Zick: https://truera.com/wp-content/uploads/2020/08/Quantitative-Input-Influence-2016.pdf

[4] “Influence-Directed Explanations for Deep Convolutional Networks” Anupam Datta, Shayak Sen, Klas Leino, Matt Fredrikson, Linyi Li: https://truera.com/wp-content/uploads/2020/08/Influence-Directed-Explanations-2018.pdf

[5] Microsoft’s Aether Working Group on Bias and Fairness Engineering Checklist — http://www.jennwv.com/papers/checklists.pdf

[6] Truera Case Study — Standard Chartered — https://truera.com/wp-content/uploads/2020/08/Truera-Case-Study-Standard-Chartered.pdf

[7] Truera Case Study — ML Explainability in Finance — https://truera.com/wp-content/uploads/2020/08/machine-learning-explainability-in-finance-an-application-to-default-risk-analysis.pdf

[8] DARPA Explainable AI Program — https://www.darpa.mil/work-with-us/ai-next-campaign

[9] “Nature Machine Intelligence” Anna Jobin, Marcello Ienca, Effy Vayena: https://www.nature.com/articles/s42256-019-0088-2

[10] “A Right to Reasonable Inferences: Re-Thinking Data Protection Law in the Age of Big Data and AI” Sandra Wachter, Brent Mittelstadt: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3248829

[11] “Research Note: Chief Data Scientist Survey”: https://www.wing.vc/content/chief-data-scientist-survey

[12] Welcome to the SHAP documentation: https://shap.readthedocs.io/en/latest/

[13] People + AI Research Portal: https://pair.withgoogle.com/

[14] MIT Tech Review “The UK exam debacle reminds us that algorithms can’t fix broken systems”: https://www.technologyreview.com/2020/08/20/1007502/uk-exam-algorithm-cant-fix-broken-system/

--

--