How to Build a Trustworthy AI System

From business specifications to model deployment

Published in

Sogeti Data | Netherlands

11 min readJun 30, 2022

To accelerate the rate of AI adoption amongst the public, corporations must learn to identify and mitigate potential ethical risks effectively. This means it is essential to have a set of best practices in place to guide the development of machine learning solutions to instill trust and integrity among relevant stakeholders and end-users.

The original text was published as part of Sogeti’s State of AI Applied to Quality Engineering 2021–22 Report and authored by Almira Pillay & Tijana Nikolic.

Artificial intelligence (AI) continues to advance and become an industry standard for performing and automating everyday tasks. The barrier to widespread AI development and adoption is no longer the technology itself and the skillset required — it’s the ‘human’ aspect of it all: ethics, morality, transparency, and governance. Corporations must now learn to identify and mitigate potential risks (such as a discriminatory model) effectively to avoid biased, unfair machine learning (ML) models being put into production. Thus, it is crucial to have a quality framework in place to guide the development of ML solutions to instill trust and integrity among relevant stakeholders.

In the State of AI applied to Quality Engineering Report, Sogeti introduces a Quality AI Framework (QAIF) to test ML and AI algorithms in all phases of the AI development cycle. The framework provides a practical and standardized way of working that outputs trustworthy AI.

Testing AI is non-traditional

When we look at the combination of AI and quality assurance (QA), there are generally two categories:

AI applied to QA — leveraging ML and analytical solutions to accelerate the testing & development pipeline.
QA applied to AI — validating the data, algorithm, outcome, and ethical considerations.

The latter is what we will focus on in this text.

In 2019, the EU released a set of guidelines to building trustworthy AI that include ethical pillars such as fairness, transparency, and robustness, in addition to accountability, privacy, traceability and lawfulness. However, a practical implementation of these principles was not included with the guideline. To understand how to create practical ethics tests and quality control checks to test AI, we must first understand how AI is developed and implemented.

From idea to operations: understanding the AI lifecycle

The process of developing and implementing an AI solution always starts with the business case. A business problem is defined and scoped and then turned into design requirements for data scientists — this is called the Business Understanding phase. Next, training and test data are collected in the Data Understanding phase. In the Data Preparation phase, the data is analyzed, sampled, and pre-processed for the chosen AI model. Then the Model Development phase can begin. In this phase, the AI model is trained, tuned, and evaluated. Once the Model Evaluation is optimized and meets the business requirements, the model can be tested and Deployed. This is when the AI system goes into production and performance is monitored. These are the iterative phases under the Cross-Industry Standard Process for Data Mining (CRISP-DM).

Testing at each phase of the AI lifecycle

Inspired by the CRISP-DM method, designed with MLOps principles in mind and governed by EU ethics guidelines; the QAIF is a cohesive, generic framework that can be tailored to any AI solution. It is designed to help product managers and business owners identify and mitigate potential risks at each stage of the AI lifecycle, and to guide data scientists by providing a practical checklist. We add a gate to each phase — just like in traditional software testing — to ensure certain quality control checks are completed. The following paragraphs describe the quality control tests that should be conducted at each phase of the AI lifecycle to ensure quality and ethical adherence.

The following gates describe some techniques and best practices for building trustworthy models.

Business understanding (Gate 1)

In this phase, the tasks of identifying stakeholders, product requirement specifications, technical design specifications, performance metrics and ethical/legal compliance will be completed and understood for the development process to be initiated. To comply with ethics standards:

Under the traceability principle, lay the foundation for an MLOps based AI project, ensuring model and data version control in subsequent phases. MLFlow is a useful package to consider for this principle as it provides support for “packaging data science code in a reusable and reproducible way”.
Under the fairness principle, define fairness metrics and detection methods if the business problem applies to sensitive groups. The fairlearn package provides support for bias mitigation methods and metrics for a model fairness assessment which can be used in the Model Development phase.
Under the privacy principle, consider the possible ethical breaches before starting the project. Focus on data sources and privacy concerns and enable mitigation methods in the Data Preparation phase.

After these quality control checks are agreed upon, review best practices and next steps with all the relevant stakeholders.

Data Understanding (Gate 2)

This phase brings specifications from the first gate together with domain knowledge and experience, to understand inherent biases, assumptions, and privacy concerns in the collected data.

Under the traceability principle, data version control is considered to set a documented specification, with information on where the data is located, what kind of source it comes from, who are the responsible people, and if there are any privacy or quality concerns. To accomplish this, consider using the DVC python package that is built on git version control.
Under the fairness principle, data collection methods are reviewed and checked for data adequacy (e.g. missing data for certain groups) and bias to prepare for deploying specific mitigation techniques in the Data Preparation phase.
Under the privacy principle, focus on assessing the data sources for personally identifiable information (PII) and consider using synthetic data to mitigate GDPR risk. For this, the Sogeti Artificial Data Amplifier (ADA) can be used.

End this phase by completing a data audit on available data sources and corresponding responsibilities of the team.

Data Preparation (Gate 3)

The data engineering team, domain experts and model developers play crucial roles in the Data Preparation phase. Tasks like data mining, a data quality assessment with exploratory data analysis (EDA) and training data construction — define this phase’s process.

under the traceability principle, data version control is once again executed, together with setting up a data pipeline to ensure complete transparency into the data preparation steps. As an example, the TensorFlow input pipeline API can be used to “build complex input pipelines from simple, reusable pieces; handle large amounts of data, read from different data formats, and perform complex transformations”.
Under the fairness principle, we deploy bias mitigation methods to ensure our training data is “judgement free” so our model will be as well. We can mitigate this by re-weighting the features of the minority group, oversampling the minority group or under-sampling the majority group. Imputation methods can also be used to reconstruct missing data to ensure the dataset is representative.
Under the privacy principle, we generate synthetic data to mitigate privacy risks or to boost our training data or if insufficient data is flagged as a limitation in the previous phase.

Provide an EDA report on the training data to a technical auditor for approval before moving to the next phase.

Model Development (Gate 4)

The AI Model Development phase starts with high quality training data. Model developers have the main responsibility in this phase — ensuring that the AI model they are developing is suited for the application and works with the data prepared in phases 2 and 3. To be sure of this, performance metrics are drawn from the model and presented to the stakeholders. Furthermore, we test the model performance and functionality on the most granular level.

Under the traceability and auditability principle, use MLFlow for model versioning and log output. Split repositories and pipelines into development, acceptance, and production. Use a git version control system to push a new version of the code to different environments. Add required reviewers in each step to ensure traceability.
Under the fairness principle, assess model adequacy and model bias through adversarial debiasing with generative adversarial networks (GANs), ensuring equal outcomes from all groups. IBM’s Trusted AI is a python package that can be used here.
Under the robustness principle, test the model performance on the most granular level and provide the accuracy scores, area under curve, F1 score, confusion matrix, mean square and absolute errors. Extend code coverage by unit testing your code using the python unittest framework.
For the business owners and regulators, interpretability is especially important in understanding what is behind the ‘black-box’ model and prediction. Implementing explainable AI (XAI) techniques like Lime and SHAP can help to understand model predictions. This adds transparency and interpretability to the model. XAI aims to mimic model behavior at a global and/or local level to help explain how the model came to its decision. Global explainability means how the model makes predictions over all the data, while local explainability gives insight into one specific prediction. SHAP (SHapley Additive exPlanations) is a method based on a game theory approach to explain individual predictions. LIME is model-agnostic and provides local model interpretability which means that it modifies a single data sample by tweaking the feature values and observes the resulting impact on the output. The output of both packages is a list of explanations, reflecting the contribution of each feature to the prediction of a data sample. This provides local interpretability, and it also allows to determine which feature changes will have most impact on the prediction.

Complete this gate by providing a model quality report to a technical reviewer for approval.

Model Evaluation (Gate 5)

As the model’s already been validated on the most granular level in the previous phase, this gate’s tasks focus on ensuring that the model is transparent and works according to the business ethical considerations set in the Business Understanding phase. Being the most important phase in the QAIF, it ensures that the AI model is fair and understandable. Stakeholders include testers, developers, and the legal team.

Under the fairness principle, assess if the model is biased through metrics like:

Statistical Parity Difference: The difference in the rate of favorable outcomes received by the minority group compared to the majority group.
Equal Opportunity Difference: The difference of true positive rates between minority and majority groups.
Average Odd Difference: The average difference of the false positive rate and true positive rate between minority and majority groups.
Disparate impact: The ratio of the rate of favorable outcomes for minority groups compared to majority groups.

Under the robustness principle, execute user acceptance tests using the XAI outputs that were implemented in the previous phase. Furthermore, execute metamorphic and adversarial tests to ensure the model is robust enough to be deployed. Metamorphic tests aim to assess model impact by transforming the inputs of the model and then testing the model with the augmented inputs. Adversarial testing aims to generate adversarial attacks to stress test the model.

Provide a test report to a technical reviewer and an auditor for approval before moving to the last gate.

Model Deployment (Gate 6)

Once we have a transparent and understandable model, we can enter the final phase with a focus on monitoring, real-world model performance and maintenance.

To ensure robustness, fairness and transparency, a monitoring dashboard should be set up to track model performance in production. The production performance metrics include:

Model performance metrics stated in the Model Development section.
Bias metrics stated in the Model Evaluation section.
Drift detection metrics like concept drift and data drift detection. According to IBM, “Model drift refers to the degradation of model performance due to changes in data and relationships between input and output variables”. This can potentially be a big issue for the model’s performance in production. So, it is crucial to use these metrics as indicators of when the model requires attention like re-training or when the training data should be re-weighted.

In this final phase, we can confidently check all the quality control boxes and pass through all the gates of development, but that doesn’t necessarily mean we are done with the AI lifecycle. When the model needs to be retrained or adjusted, we can always turn back and revisit any of the phases, as the AI project life cycle is an iterative process.

Putting the QAIF into practice

The QAIF does not serve only as a theoretical guide to ensuring a trustworthy and high performing AI solution. Sogeti, together with their partners from the ITEA IVVES (Industrial Grade Verification & Validation of Evolving Systems) consortium, have developed practical tools to be used by data scientists along the QAIF gates.

The ITEA IVVES project focuses specifically on the development of AI approaches for robust and comprehensive, industrial-grade validation of embedded AI — complex, evolving (self-adapting and self-learning) systems in the major industrial domains of Europe. The objective of the IVVES project is to develop cross-domain, quality assurance solutions and methodologies that are dedicated to and based on AI. The expected output are methods, techniques, and tools for validating and verifying evolving systems, as well as a platform for experimentation, training, and knowledge transfer.

The state-of-the-art tools developed within the project allow quality control checks to be automated at every stage of the QAIF and ML development cycle. The tools enable efficient testing and operationalizing of ML, and many of these tools can be deployed as micro-services within the CI pipeline. The consortium partners have provided business cases and data to test the developed tools.

One of these tools includes the Sogeti built Data Quality Wrapper (DQW). The DQW serves as a data quality control check, to be used during the Data Understanding phase. The tool automates the EDA process by assessing tabular, image, audio and text data. The tool uses various statistical methods and NLP techniques to assess quality. Sogeti has also developed CodeAssist, a code assessment tool that predicts the quality of python code, helping to speed up peer reviews and focus unit testing. This can be used during the Model Quality phase.

Another practical tool includes RISE’s (Research Institute of Sweden) RoCoNas adversarial generator. This tool evaluates the robustness of a neural network by attacking it with generated adversarial examples. This can be used during the Model Evaluation phase to find test scenarios where the model fails. Helsinki University, Techila Technologies and F-secure collaborated on an Inference Scalability tool that can simulate deployment configurations for ML models by evaluating and optimizing inference setup parameters. This can be used in the Model Deployment phase.

These tools among many others developed as part of the ITEA IVVES project, aid a data scientist in automatically embedding quality, fairness, and transparency into any AI model. The tools also aid a product owner in operationalizing the development of AI. For corporations adopting AI solutions, this should be a requirement and not a luxury.

The QAIF offers a structured and comprehensive way of working to develop and implement high-performing, ethical and quality-assured solutions; helping AI teams to design, develop and operate AI systems that the public can trust.