GMLP — Good Machine Learning Practice — what the Principles mean when developing medical AI

Lucille Valentine

Published in

gliff.ai

13 min readApr 15, 2022

A response to: Good Machine Learning Practice for Medical Device Development: Guiding Principles | gov.uk

co-written with Chas Nelson, Chief Technology Officer at gliff.ai

clinician in a white coat holding a tablet — the screen has a human form and graphs and tags to represent algorithmic decision assistance

good machine learning practice: guiding Principles

The MHRA (UK), FDA (US) and Health Canada have jointly published their ten guiding principles for developing machine learning-based medical devices (GMLP). Their intention is that the principles will help promote safe, effective, and high-quality medical devices that use artificial intelligence and machine learning (AI/ML).

At gliff.ai we think there’s a lot to love about these principles and they align closely with what we are trying to do. However, we think there could be a little more clarity in language and also in how these principles can actually be implemented. This article is our interpretation of these principles. We hope this article aids implementation by those who are developing machine learning medical devices!

Principle One: Multi-Disciplinary Expertise is Leveraged throughout the Total Product Life Cycle

context

Essentially, we need to include clinical (and other appropriate) experts from the very beginning, from the initial idea, through the machine learning (ML) model development process and also throughout the lifecycle of the device, including post-deployment.

So what is the total product lifecycle? Well, whilst this article isn’t the place for a detailed discussion on all the components of the lifecycle, it is perhaps worthwhile to clarify that the total product lifecycle of an ML includes two distinct phases: development and deployment.

Flowchart of the ML lifecycle Borrowed from the very useful: https://towardsdatascience.com/the-machine-learning-lifecycle-in-2021-473717c633bc — Borrowed from the very useful: https://towardsdatascience.com/the-machine-learning-lifecycle-in-2021-473717c633bc

Let’s take an overly simplistic scenario — a data scientist coming up with an idea on their own vs a data scientist and a clinician working as a team for the total product lifecycle. Perhaps our scenarios might look something like this:

We can see how this process has gone wrong right at the start. The idea that came from the data scientist working alone, which may be excellent ML, has not become a useful clinical solution, perhaps because:

there is no unmet clinical need for this solution (i.e. the idea has not been initiated by clinician or patient needs);
the data used for training the ML is not representative of the patient group (see guiding Principle Three) — a clinician or healthcare data expert could have guided against this problem as they know the patient population of interest;
the “expert” annotations on the data, e.g. diagnostic or prognostic information, were not properly collected, perhaps the annotations were even done by the data scientist themselves — how can clinical users and patients trust the trained ML without trusting the data annotators?
the trained ML does not provide an overall patient benefit (for example, by not being sensitive or specific enough) or increases patient risk (as an example, by requiring a specific surgical procedure for data collection)
the final product doesn’t work in a clinical setting because the development didn’t consider user experiences, the format of the application’s inputs or outputs aren’t suitable, or perhaps the application assumes the user is the patient, when the user is actually the clinician;
the final product is not able to be monitored after deployment.

End result: no benefit to the clinician, no benefit to the patient.

implementation

To implement Principle One, the clinical expert must be explicitly involved to make it easy to capture and retain their input through the ML development and product deployment processes (and beyond).

During development, clinical experts should work alongside data experts for data curation and annotation and be able to guide the development of the ML model for its intended use. This could be through no-code / low-code tools that are built with the unique challenges of medical data in mind. These tools should empower clinicians to lead in the ML product lifecycle and allow those on the frontline to build the case for expected patient benefits and risks and of how and where existing care is delivered.

Principle Two: Good Software Engineering and Security Practices are Implemented

context

Like the total product lifecycle this principle can perhaps be considered from two different points of view: ML development and ML deployment.

During development, the fundamentals of data quality assurance and robust cybersecurity practices must be in place from the start. Often data is real patient data, anonymised or not, and must be treated with respect of privacy at the core. So ensuring your dataset meets security practices is the first factor you need to consider. And different data sources have different implications for quality and security:

Patient-controlled or ‘edge’-based data, where patients control their healthcare information and choose exactly to whom, why and when data can be released. For ultimate security in this scenario data will never leave the patient’s own data store.
‘Institutionally’-based data, where large datasets have been built up through day-to-day running of healthcare institutions, e.g. individual NHS trusts or the whole of the NHS in the UK. Currently these types of data are controlled through Trusted Research Environments (TREs) or Safe Havens and may be accessed through end-to-end encryption or federated approaches for increased security.
Public datasets, often created by pooling datasets from institutions or crowd-sourced from patients. Here the ‘security’ of the data comes through anonymisation but, although such datasets are seemingly powerful, a recent Nature Machine Intelligence paper highlighted that public datasets often have a variety of problems that prevent them being used for productionised ML.

And then, as the model itself is developed, good software engineering practices (including methodical risk management and design processes) should be in place to appropriately capture and communicate design, implementation, and risk management rationales and decisions. In plain English — why was each decision made?

implementation

Implementing good software engineering and security practices is a specialised activity and, to some extent, this mirrors the first point. As well as having clinicians involved at every step, data engineers and ML scientists need to be engaged to ensure best practice throughout.

Principle Three: Clinical Study Participants and Data Sets are Representative of the Intended Patient Population

context

On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

— Charles Babbage, Passages from the Life of a Philosopher

Or in modern English — garbage in, garbage out!

So that ML model results can be appropriately generalised to the intended patient population, the dataset has to be representative of that population. There’s no point developing and testing an ML purely on female data, if its planned use is purely for males — even if the features that you think your ML will use are shared across both sexes.

The model developers must ensure that the relevant characteristics of the intended patient population and factors related to the patient, site and data acquisition are sufficiently represented in an adequately sized database. Characteristics include for example, age, gender, sex, race, and ethnicity, but also what the data is used for, and other clinical inputs like blood test results or the complaint that they are being treated for.

Without considering all of these potentially impacting factors, datasets and the ML models trained on them can be biassed, imbalanced or unrepresentative — and lead to unusable or, worse, unsafe outcomes.

implementation

Ensuring that data characteristics are included in the datasets (i.e. metadata) allows model developers to assess (and perhaps mitigate) bias and imbalance, assess usability, and identify circumstances where the model may underperform (edge cases).

The more metadata in a dataset, the more factors which can be thoroughly and robustly investigated. Contrariwise, the more metadata in a dataset, the more personally identifiable patient information is included too, leading to a potential data privacy and security risk.

One solution to this risk is to use techniques like end-to-end encryption of data, which is one of the highest standards for data security (and which gliff.ai uses for customer data) to ensure that data breaches or leaks are incredibly unlikely and thus reduce that risk.

Principle Four: Training Data Sets are Independent of Test Sets

context

Let’s start this principle with a little clarification on terminology as used by the wider ML community.

Raw dataset — all the data used for training and testing an ML, which could be either annotated (by clinicians; for supervised ML) or unannotated (for unsupervised ML). The complete dataset will be huge (100,000s of items) and must satisfy Principle Three (above).
Training data — a subset of the raw dataset used only for training the ML. This will likely by 80–90% of the raw dataset. Training data may be used for ‘validation’, a step used within ML training to ensure the best results but is not used for testing the performance of the trained ML.
Testing data — the remainder of the raw dataset is used only for testing the trained ML. None of the testing data is included in the training data to ensure that the performance of the trained ML is being tested on new or “blind” data.

Principle Four emphasises that the training and test data should be independent of each other. In the wider ML community it is usually sufficient to ensure that rows/items of the dataset are not repeated in the training and testing data. However, in the medical space two items could be two different medical scans of the same patient.

implementation

All potential sources of dependence should be considered, including factors related to the patient, site and data acquisition. To implement this principle, the ML model developers need to be able to analyse dependencies in datasets and therefore need data that includes potentially identifiable metadata.

Further, whilst training and test data should be independent, model developers should be able to demonstrate that they have been through the same quality assurance and data processing. To achieve this, model developers should use software tools that provide a combination of dataset versioning and a thorough audit trail to ensure there is clarity over all the data preparation and the dataset splitting undertaken.

Principle Five: Selected Reference Datasets are Based Upon Best Available Methods

context

One more definition following on from Principle Four:

Reference dataset — a completely separate dataset with human-constructed “gold standard” and clinically-relevant annotations. This dataset is not used during the development of the ML at all but rather as a distinct performance evaluation step between development and deployment of an ML model. Like the raw dataset, this data must satisfy Principle Three (above). Unlike the raw dataset, a reference dataset may be held by an independent party who can evaluate the model without vested interest.

If available, reference datasets should be used in model evaluation to promote and demonstrate model robustness and generalizability across the intended patient population.

implementation

Model developers should be able to access reference datasets through the same processes as training and test datasets and be confident that the reference dataset matches a) the characteristics of the appropriate patient population and b) the characteristics of the training/test datasets.

Principle Six: Model Design is Tailored to the Available Data and Reflects the Intended Use of the Device

context

Unfortunately, ML models are not yet “general” — i.e. each model is designed to carry out specific tasks and with assumptions about the input data. As such, any ML medical device needs to be accompanied by a clear set of specifications regarding the intended use of the model (the task and intended population), and the real data that will be used in the clinic for the end-users.

Once the use and data are well understood then the ML model design can be suited to that available data and data scientists can support the active mitigation of known risks, such as overfitting, performance degradation, and security risks in the model.

implementation

To implement this, ML model developers should have a repeatable set of tests for known dataset and ML model related risks. Since the goal of understanding risks is to mitigate their effects upon patients, these risks should be assessed against clinical use conditions, and it should be straightforward to show that all identified risks have been assessed and/or mitigated.

Risks that are seen to affect patients can be mitigated by either adding more data to datasets to improve representation (Principle Three) or changing the user interfaces to improve the human-AI team (Principle Seven).

Principle Seven: Focus is Placed on the Performance of the Human-AI Team

context

There are two human-AI teams in ML medical devices: the developing team and the end-use team.

Developing teams should focus on using the human-in-the-loop principle, where an expert human guides the cyclical development of the ML, to ensure that Principles One to Six are implemented.

implementation

Crucially, consider how the human users respond to the AI output and how this may change care. The performance of the end-use human-AI team should be predictable and ensured by good software engineering and user experience practices (Principle Two).

Here gliff.ai believes that the human-in-the-loop should be flipped and the aim should be to have the ‘AI-in-the-loop’. In this case the human (e.g. clinician or patient) is in control and leads the process. The ML medical device supports that human by providing interpretable outputs that can be used for the task at hand (Principle Nine).

Principle Eight: Testing Demonstrates Device Performance During Clinically Relevant Conditions

context

Medical devices are not used in isolation or solely in the lab but rather with humans and in the clinic or field. An MRI machine would be useless if it required more electricity to power than the national grid could provide — all the other hospital equipment would fail. Similarly, MRI images would be less useful if you needed a PhD in physics to be able to understand them. Likewise, ML medical devices must be practical and usable in relevant clinical conditions and interpretable by clinicians in those conditions.

implementation

To implement Principle Eight, tests designed to evaluate the ML medical device should be carried out in exactly the same conditions that the device is intended to be used in. And the limitations of the evaluation should be clear so that ML medical devices are not improperly used in scenarios for which they have not been adequately tested.

Such evaluations should be on-going for ML medical devices, e.g. when training datasets are improved or the clinical setting changes, or a new make/model of imaging device generates the data.

As with Principle Three, these tests should take into consideration the population represented by the data and consider how those factors may also confound evaluation and/or future clinical use.

Principle Nine: Users are Provided Clear, Essential Information

context

Ask a data scientist to answer a question and you may be hit with a barrage of technical information, statistics and caveats. But is that always what a clinician or patient using an ML medical device needs? Probably not.

So what does the end-user need from an ML medical device? Well, that’s obviously going to be contextually dependent. However, certain key features need to be considered when providing the end-user the outputs on any ML medical device.

For gliff.ai, this principle seems reminiscent of medicines — whether that be prescription or over-the-counter:

Take a packet of painkillers off the shelf and you will notice that the front of the pack tells you what the painkiller is, it’s intended use (pain relief, anti-inflammatory, headache relief, etc.) and dose — the first important information displayed very clearly to prevent accidental confusion of two drugs.
Next, look at the back of the packet — you’ll now see the correct usage (two tablets twice a day, no less than four hours between doses, do not take more than four tablets a day, etc.) and also key warnings about the products use (overdose amounts, dangers for children or other subgroups, allergens) — the second group of important information, again displayed very clearly to prevent accidental misuse of the drug.
Finally, you’ll see “Please read the enclosed leaflet before use” on the packet and on doing so you will find out all the caveats relating to side effects, frequency of side effects, subgroups that the drug has not been tested on, e.g. not tested on pregnant women, and more — the third group of important information that ensures that users have a complete idea of the intended use of the medicine and also the “performance” of that medicine and it’s evaluation.

implementation

The multidisciplinary team put together in Principle One should include user interface/user experience and data visualisation experts who can repeat this approach when designing the interfaces for ML medical devices. Before using the device the user should know what the device is and what it should be used for (the front of the pack); how to use the device and any potential dangers (the back of the pack); and what the limitations and implications of the device being used are (the leaflet).

Unlike painkillers though, ML medical devices also have a result, like a pregnancy test, and again the user interface/user experience and data visualisation experts developing ML medical devices should consider how best to ensure the result is delivered to the end-user as clear and succinct as can be — with the knowledge that the “leaflet” information makes users aware of such things as ML medical devices updates or modifications.

Principle Ten: Deployed Models are Monitored for Performance and Re-Training Risks are Managed

context

There are many reasons why the performance of a ML medical device might change after product release. Changes in the patient population might mean that the considerations for Principle Three are no longer valid; likewise changes in the clinical environment (Principle Nine) may impact the performance of the human-AI team. As such, models should be continuously monitored to ensure no degradation of performance is occurring that could potentially lead to patient harm.

A common solution to this is to regularly retrain the ML medical device on updated datasets that are also subject to the described principles above. Retraining introduces potential risks, e.g. the ML medical device might provide different diagnostic/prognostic information for a patient before and after retraining. Retraining of models should, therefore, undergo the same rigorous controls as the original model development, i.e. comparison to high quality reference datasets (Principle Five); focussing on the human-AI team (Principle Seven) and on clinically-relevant testing (Principle Eight) and so on.

implementation

This principle highlights the need to consider performance measurements and risk assessment and mitigation continuously throughout the development lifecycle and the product lifetime. These processes may be the same for both development and post-deployment monitoring and are covered by Principles One to Nine.

One key practical impact is that training/test datasets and ML models should be version controlled, with the ability to easily rollback the device to an earlier, more performative version and/or to compare the performance of different versions model trained on different datasets. This brings ML medical devices in-line with industry-standard software engineering practices (see Principle Two).

GMLP — Good Machine Learning Practice — what the Principles mean when developing medical AI

good machine learning practice: guiding Principles

Principle One: Multi-Disciplinary Expertise is Leveraged throughout the Total Product Life Cycle

context

implementation

Principle Two: Good Software Engineering and Security Practices are Implemented

context

implementation

Principle Three: Clinical Study Participants and Data Sets are Representative of the Intended Patient Population

context

implementation

Principle Four: Training Data Sets are Independent of Test Sets

context

implementation

Principle Five: Selected Reference Datasets are Based Upon Best Available Methods

context

implementation

Principle Six: Model Design is Tailored to the Available Data and Reflects the Intended Use of the Device

context

implementation

Principle Seven: Focus is Placed on the Performance of the Human-AI Team

context

implementation

Principle Eight: Testing Demonstrates Device Performance During Clinically Relevant Conditions

context

implementation

Principle Nine: Users are Provided Clear, Essential Information

context

implementation

Principle Ten: Deployed Models are Monitored for Performance and Re-Training Risks are Managed

context

implementation

Written by Lucille Valentine