Feedback on the FDA’s Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software

16 min readAug 21, 2019

This post is in response to the call for feedback regarding the “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD)” document put forward by the FDA on Apr 2, 2019. Contents of this post were uploaded to the regulations.gov comments section on May 3rd, 2019.

We applaud the FDA’s initiative to address the life-long learning aspects of some AI systems. Follows is an excerpt from a recent review our lab published in Nature Review Cancer in May 2018. The excerpt outlines regulatory aspects related to governing AI-based system submissions seeking approval. We also specifically identify AI systems’ continuous learning as requiring special regulatory oversight to ensure improvements and mitigate risks.

“From a regulatory perspective, discussions are underway regarding the legal right of regulatory entities to interrogate AI frameworks on the mathematical reasoning for an outcome. While such questioning is possible with explicitly programmed mathematical models, new AI methods such as deep learning have opaque inner workings, as mentioned above. Sifting through hundreds of thousands of nodes in a neural network, and their respective associated connections, to make sense of their stimulation sequence is unattainable. An increased network depth and node count brings more complex decision making together with a much more challenging system to take apart and explore. On the other hand, we find that many safe and effective US Food and Drug Administration (FDA)-approved drugs have unknown mechanisms of action. From that perspective and despite the degree of uncertainty surrounding many AI algorithms, the FDA has already approved high- performance software solutions, though they are known to have somewhat obscure working mechanisms. Regulatory bodies, such as the FDA, have been regulating CADe and CADx systems that rely on machine learning and pattern- recognition techniques since the earliest days of computing. However, it is the shift to deep learning that now poses new regulatory challenges and requires new guidance for submissions seeking approval. Even after going to market, deep learning methods evolve over time as more data are processed and learned from. Thus, it is crucial to understand the implications of such lifelong learning in these adaptive systems. Periodic testing over specific time intervals could potentially ensure that learning and its associated prediction performance are following forecasted projections. Additionally, such benchmarking tests need to adapt to AI specifics such as the sensitivity of prediction probabilities in CNNs.” (Hosny et al. 2018)

Our comments are summarized as follows:

Definition of AI/ML: The definition of AI/ML should be more generally established to better situate this discussion within the context of how the FDA has been, for many years, regulating medical software with “intelligence” components. A deeper explanation of why continuous learning is a crucial component in some SaMD today.
Definition of “locked”: The definition of a “locked” algorithm requires further clarification. Additionally, a clear distinction between the “locked” state of AI/ML model and that of the software environment running it must be made — as software updates are inevitable.
SaMD modification types: A clear distinction between the motivation behind a modification and the specifics of its implementation should be made. We recommend that the categories of AI/ML-SaMD modifications be tied to how (and where) these modifications are implemented rather than by the motivation behind them. In addition to categorizing modifications based on the motivation behind them, we propose the following 6 attributes to further categorize the implementation of the modification: Criteria, Site, Frequency, Context, Impact, Enforcement. These are further explained in this document.
SPS vs ACP: A clearer distinction between the SPS and ACP. While the SPS’s scope can be limited to a high level description of the modification envelope, the ACP can be purely technical.
ACP structure: This could include required and optional components to better generalize to a wider range of modifications as described in the SPS.
Scope of modifications: We propose that the SPS+ACP combination be limited to modifications that do not lead to a new intended use as, in many cases, it is very difficult to anticipate the potential risks involved when a SaMD is to be used in a different healthcare setting. Additionally, a “change in intended use” can vary widely in scope and must be further defined.

The aforementioned comments are expanded within the following sections:

Section 1: Comments on “I. Introduction” and “II. Background: AI/ML-Based Software as a Medical Device”
Section 2: Comments on “III. Types of AI/ML-based SaMD Modifications”
Section 3: Comments on “IV. A Total Product Lifecycle Regulatory Approach for AI/ML-Based SaMD”

Section 1: Comments on “I. Introduction” and “II. Background: AI/ML-Based Software as a Medical Device”

“The ability for AI/ML software to learn from real-world feedback (training) and improve its performance (adaptation) makes these technologies uniquely situated among software as a medical device (SaMD) and a rapidly expanding area of research and development.”

The introduction section implies that the introduction of AI/ML-based technologies in medical devices is relatively novel. In fact, the FDA has been regulating computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems for many years, systems that inherently contain AI/ML components. These systems, however, mainly relied on rule-based methods or a priori handcrafted features. Since around 2012, deep learning has become a de facto method for building machine learning models, and has since then contributed to many of the breakthroughs witnessed recently. While both systems have the ability to “learn from real-world feedback (training)”, deep learning methods can automatically learn feature representations from data without the need for prior definition by human experts. This is believed to be the main reason behind deep learning methods’ more superior performance over their traditional rule-based counterparts. Figure 1 below outlines the technical differences between these two methods within the radiographic image analysis context as an example.

**Figure 1 Artificial intelligence methods in medical imaging.** This schematic outlines two artificial intelligence (AI) methods for a representative classification task , such as the diagnosis of a suspicious object as either benign or malignant. (a) The first method relies on engineered features extracted from regions of interest on the basis of expert knowledge. Examples of these features in cancer characterization include tumour volume, shape, texture, intensity and location. The most robust features are selected and fed into machine learning classifiers. (b) The second method uses deep learning and does not require region annotation — rather, localization is usually sufficient. It comprises several layers where feature extraction, selection and ultimate classification are performed simultaneously during training. As layers learn increasingly higher- level features, earlier layers might learn abstract shapes such as lines and shadows, while other deeper layers might learn entire organs or objects. Both methods fall under radiomics, the data-centric, radiology-based research field. Taken from (Hosny et al. 2018)

“the critical question of when a continuously learning AI/ML SaMD may require a premarket submission for an algorithm change”, “The traditional paradigm of medical device regulation was not designed for adaptive AI/ML technologies”

It might be beneficial to provide some context here as to why this continuous learning aspect is crucial in AI/ML systems today and has historically not been an issue. Research shows that the performance of previously utilized rule-based systems did not generally improve with retraining i.e. training with more data did not necessarily translate to performance gains. Hence, it was not entirely beneficial to introduce a continuously learning rule-based system. On the other hand, research also shows that the performance of today’s deep learning methods scales with data, explaining why continuously evolving AI/ML systems have only surfaced recently and hence require new regulations. Figure 2 below illustrates how the amount of data scales with the performance of traditional machine learning methods (in red) and more recent deep learning methods (yellow, blue, and green — in increasing model learning capacity). To capitalize on these potential performance gains, many propose to continuously retrain models with new data. As such, it is crucial to introduce regulations that govern these lifelong learning systems.

**Figure 2 Scale drives deep learning performance.** Deep learning course, Coursera, Andrew Ng.

“We define a “locked” algorithm as an algorithm that provides the same result each time the same input is applied to it and does not change with use. Examples of locked algorithms are static look-up tables, decision trees, and complex classifiers.” “Locked algorithms are those that provide the same result each time the same input is provided.”

We believe the definition of a locked model is ambiguous and requires further clarification. Locked model, in this context, refers to a model with fixed or locked parameters i.e. the model has already been trained on a given data (a process where the model parameters are adjusted to achieve the best performance possible) and is not further trained. On the other hand, the requirement for a model to “provide the same result each time the same input is applied to it” applies to any given model whether the model is locked or not. The model stability in inferring on a given input, technically termed inference, is an absolute minimum requirement for ensuring repeatability and robustness — and should not be confused with the model’s parameter state: fixed or continuously updated. Additionally, “static look-up tables, decision trees, and complex classifiers.” are general examples of different types of machine learning models that could either be locked or unlocked. A “decision tree” could be locked without further training, or continuously retrained on new data points with an ultimate goal of improving performance and generalizability (the ability of a given model to generalize to previously unseen data, especially those data that might differ in provenance, collection protocols, structure…etc).

“However, not all AI/ML-based SaMD are locked; some algorithms can adapt over time.”

It is crucial to clarify here that the state of being locked is not a function of the model type. A such, any given machine learning model — regardless of its underlying methods — can be presented in a fixed “locked” state, or alternatively be continuously updated. Additionally, a “locked” model can live in a continuously updated software environment (all software is updated periodically). Therefore, identifying an entire SaMD as being “locked” may not be entirely accurate. A clear distinction between the “locked” state of machine learning model itself and the software environment running it must be made.

‘Although AI/ML-based SaMD exists on a spectrum from locked to continuously adaptive algorithms, …”

In our opinion, the parameter state of a given machine learning model is binary i.e. either fixed (locked) or continuously being updated (either a one-time or periodic update). We do not see this state to be on a continuous spectrum.

Section 2: Comments on “III. Types of AI/ML-based SaMD Modifications”

A clear distinction needs to be made between the motivation behind a modification and details of its implementation.

Motivation behind a given modification may generally include (but not limited to) :

Proactively improving performance or avoiding performance degradation over time
Tailoring the SaMD to a new intended use
Improving the presentation and visualization of outputs to end users
Improving the explainability of SaMD outputs
Improving the detection of failure modes
Responding to new medical hardware (e.g. imaging equipment) and the associated (potentially new) data types and formats
Responding to inevitable changes in the practice of medicine over time
Responding to new therapeutic discoveries
Responding to newly discovered errors in data previously used in model training
Introducing new features and/or fixing bugs (commonplace to any software product)

A modification can be described in terms of its:

Criteria
Site (where in the AI/ML pipeline it is to be introduced)
Frequency
Context
Impact
Enforcement

1. Criteria

A criteria for implementing a modification must be predetermined. For instance, if the motivation is to improve the performance of a SaMD, what improvement performance metrics and what threshold will be used to decide on whether a modification is worthy of being implemented.

2. Site

“The types of modifications generally fall into three broad categories: …”

The three categories of AI/ML-SaMD modifications proposed here capture a large portion of potential updates to algorithms. However, we propose linking these modification to different components of the AI/ML-SaMD so as to provide further clarity. The figure below outlines a high level schematic of an AI/ML pipeline deployed in a clinical setting. Each component can be tied to a specific type of modification. Hence, modifications can be identified based on the intervention site within the pipeline.

**Figure 3 High-level schematic of AI/ML SaMD**

We propose the following categories of modifications based on site (pipeline intervention). These may also not be mutually exclusive:

1.Modifications to the inputs: This maps directly to the proposed “ii. Modifications related to inputs, with no change to the intended use” i.e. allowing the AI/ML SaMD to work with new types of input data. Most of the current breakthroughs in AI research fall under the narrow AI category: AI that is able to perform one task and one task only. As such, it is less likely in the near future that these “new types of input data” will be radically different from the original — and they are more likely to be correlated. If the dimension (or size) of the new input data is different than the original, then this requires also modifications to the model itself, and hence the modification spills into category 2 below.

2. Modifications to the AI/ML model itself: This maps directly to the proposed “i. Modifications related to performance, with no change to the intended use or new input type”. This could be for the purpose of enhancing performance (accuracy, sensitivity, specificity ..etc) by retraining on new data, accepting input data for different dimensions as mentioned above, or it could also be for optimizing or pruning the model (getting rid of redundant nodes in the model, and hence speeding up inference time — the time needed to obtain a prediction on a given input). These modification can come in the form of:

updating the model parameters (weights that are automatically learned from training data),
a change in the model hyperparameters (“tuning” knobs that allow developers to control how the model learns),
a change in the model architecture or structure,
a change in model cost/error/loss function,
or a combination of the above.

3. Modifications to the outputs: This would be a new category that encompasses part of the proposed “iii. Modifications related to the SaMD’s intended use”. The healthcare setting or intended use might remain unchanged, but the AI/ML SaMD can provide further information through the model outputs. The example provided “a change in the significance of information provided by the SaMD (e.g., from a confidence score that is ‘an aid in diagnosis’ (drive clinical management) to a ‘definitive diagnosis’ (diagnose)).” fits well within this category of modifications.

4. Modifications to the software environment: How will modifications to software that does not involve input data, model, or output data be handled? These could include pre-processing carried out on the input data or post-processing carried out on the output data. Examples include normalization, standardization, interpolation, cropping ..etc. For simplicity, these modifications can also be included in the modifications #1 and #3 proposed here respectively. Ultimately, software upgrades are inevitable.

5. Modifications to the healthcare setting: This category deals with intended use and encompasses a part of the proposed “iii. Modifications related to the SaMD’s intended use”.

3. Frequency

Additionally, modifications should also be identified in terms of their frequency:

One-time: A one-time intervention with no further modifications planned. An example could include a one-time pruning of the model to speed up inference time.
Periodic: A periodic modification that could include retraining with new input data at regular intervals. During this interval, the data could be collected and curated, and the retraining could be validated before AI/ML is re-deployed.
Online: Modifications where retraining occurs on the fly every time inference is run on input data. For instance, a tumor contouring model could output a segmentation map and present it to the human operator. The operator can then approve it, or otherwise adjust it. These adjustments can then be used to retrain the model on that single data point — also known as online learning in ML research.

4. Context

Modifications should also be identified in terms of context: (referred to in the proposal as global vs local)

General: These modifications are applied across all deployed instances of the AI/ML SaMD regardless of the setting and healthcare centers.
Setting-specific: These modifications are specific to a particular setting. For instance, an algorithm could be modified to better align with existing patient care protocols in a given hospital, cater to preferences of specific users, tailor the presentation of outputs depending on users’ skill level, or be designed to better serve a certain demographic by modifying its sensitivity/specificity. This also raises another important issue around the regulation of AI/ML SaMD when different versions of it may exist with slight differences among them.

5. Impact

Operational impact: Will it require human operators to be retrained? Will it affect the functionality of other AI/ML systems further down the clinical workflow? Will it alter or otherwise negate outputs from the previous versions? Do AI/ML model outputs have an “expiration date” or “validity period” that should be considered in light of this modification?

Patient impact: Will it require patient consent to be updated? Should legacy patients be informed of this modification and what impact does it have on them?

6. Enforcement

Can end users choose to opt out of receiving a model update? Can users rollback to a previous version?

Section 3: Comments on “IV. A Total Product Lifecycle Regulatory Approach for AI/ML-Based SaMD”

“1. Quality Systems and Good Machine Learning Practices (GMLP):”

Other examples of GMLP considerations as applied for SaMD may include:

Ensuring medical data sourcing adheres to certain ethical and fairness guidelines (Vayena, Blasimme, and Cohen 2018). In addition to being clinically relevant, the acquired data should also be sensitive to its intended target patient pool, including mitigation of bias in demographics, sex, age, race ..etc.
Ensuring reproducibility, repeatability, and portability of the computational AI/ML environments through controlling runtime environments and third party library versions. The use of containerization schemes (e.g. Docker) to achieve such control has recently grown in popularity.
Implementing a robust model versioning protocol to track changes and updates in model architecture and parameters during development as well as deployment. This life cycle management for deep learning models (Miao et al. 2016) is an example of such protocol.
In addition to avoiding data leakage between train/tune/test sets as mentioned, seeking external and independent test datasets to ensure generalizability and robustness.
Conducting regiouous unit testing to ensure the AI/ML pipeline is working as intended. This is a common software engineering best practice. In this case, however, it is approached in a slightly different manner given the existing AI/ML components, and it is not to be confused with testing the model against test data.
Identifying edge conditions and potential failure modes that may not exist in the collected data. For instance, a model trained to identify cancer nodules in chest radiographs may fail when presented with a patient with a collapsed left lung.
Adherence to some ISO software standards, AGILE practices, and the like.

“2. Initial Premarket Assurance of Safety and Effectiveness:”

“SaMD Pre-Specifications (SPS)”

As outlined, the SPS should focus on a high-level view of “what the manufacturer intends the algorithm to become as it learns”. As such, it should include:

The modification(s) type in terms of its:

the motivation or objective
criteria, site, frequency, context, impact, and enforcement

“Algorithm Change Protocol (ACP)”

We believe this component should be renamed to SaMD Change Protocol to better encompass changes that may affect other components of the SaMD other than the algorithm (or model) as outlined in section 2. We also believe this component should focus more on the technical details pertaining to implementing modifications outlined in the SPS. This will create a clear distinction between the two without overlap. As such, the ACP can be highly technical and hence will vary depending on the modification described in the corresponding SPS. For instance, “re-training objectives” (Figure 4) is a high-level concept and should thus be included in the SPS (see above) rather than the ACP.

Additionally, we believe the algorithm change protocol components in Figure 4 (data management, re-training, performance evaluation, and update procedures) do not generalize very well to the wide range of modification possible. For instance, changes in the output of a model and how that output is communicated to users might not require descriptions of data management or re-training. We propose to present some of these components as optional in the ACP. As such, developers may pick and choose the most appropriate components to adequately describe the modifications outlined in the SPS. These components may include:

Required components for any kind of proposed modification:

Validation type (analytical, clinical, both)
Validation benchmark (reference standard)
Validation metrics, statistics..etc
Testing (testing protocol including testing the entire software environment)
Deployment (deployment plan in production environments)
Communication to end users
Effects on legacy patients/users
Effects on patients under treatment during modification rollout
Protocols for potential recalls and rollback to previous versions

Optional components to pick and choose from depending on the nature of the SPS:

Data management (collection, curation, QA, licensing..etc)
Input modification (changes to input size, changes in pre-processing)
Model modification (description of changes and how they are implemented)
Output modification (changes in post-processing or communication of results to users)

“3. Approach for modifications after initial review with an established SPS and ACP”

The comment hereafter is also in response to “In these cases, it may not be appropriate for a proposed SPS and ACP to manage the risks to patients or align with the initial authorized intended use.”

If the FDA identifies modifications leading to a new intended use as “potentially” requiring a premarket review (as outlined in Figure 5), then it might be optimal to establish — from the outset — that SPS+ACP, by definition, should not include modifications leading to a new intended use. This will

A. limit and focus both the scope of and expectation from the SPS+ACP and

B. further simplify the approach. In such case, if the modifications are outside the agreed SPS+ACP, then they automatically require a premarket review for one or both of the following reasons:

Either the modifications lead to a new intended use (a premarket review is inevitable here)
or the modifications do not lead to a new intended use but have not been initially included in the SPS+ACP by the developer (For such cases, the focused FDA review can take place. This may allow the developer to propose a modification to the SPS+ACP on the condition that this modification does not lead to a new intended use — prior to submitting the SaMD modification itself)

“4. Transparency and real-world performance monitoring of AI/ML-based SaMD”

While “modification notices” to patients, clinicians, and general users are essential for communication and transparency, we fear they may never be consulted. This is true for any “updated terms of use” document or “new version release” statement for any software. These often include highly specialized legal and technical jargon respectively. They are often agreed to without being read in their entirety. How do we avoid such a situation for newly modified SaMD? How can developers present the information in a concise and simple manner without undermining transparency and completeness?

While real-world performance monitoring is crucial, in some cases it might be unattainable as ground truth data (to compare against the SaMD outputs while it is in production) may either be unavailable or can only be obtained in a future time point. An example of this could be patient outcome and survival endpoints that may take years to collect. The regulatory requirement for this kind of monitoring may hopefully force developers to device means to collecting ground truth data during SaMD deployment where possible. For instance, a lung nodule detection SaMD may present its findings and ask the operator to either accept or reject those findings. As such, the SaMD can be deployed and performance data can be collected simultaneously where human-AI interaction can prove to be a main source for collecting real-world performance data.

In cases where real-world “ground truth” is unattainable, performance monitoring can also be conducted offline. This can be through regular testing on pre-approved benchmarking datasets that are known to be representative of the patient population served by the SaMD. However, these datasets might not remain representative over time as diseases, demographics, and medicine are constantly evolving.

Another means of reporting performance can come in the form of declaring the deviation in outputs for the same input data across multiple “modified” versions of the SaMD.

For further questions or clarifications, please do not hesitate to contact us at ahmed_hosny@dfci.harvard.edu or ahmed@ahmedhosny.com.

References

Hosny, Ahmed, Chintan Parmar, John Quackenbush, Lawrence H. Schwartz, and Hugo J. W. L. Aerts. 2018. “Artificial Intelligence in Radiology.” Nature Reviews. Cancer, May. https://doi.org/10.1038/s41568-018-0016-5.

Miao, Hui, Ang Li, Larry S. Davis, and Amol Deshpande. 2016. “ModelHub: Towards Unified Data and Lifecycle Management for Deep Learning.” arXiv [cs.DB]. arXiv. http://arxiv.org/abs/1611.06224.

Vayena, Effy, Alessandro Blasimme, and I. Glenn Cohen. 2018. “Machine Learning in Medicine: Addressing Ethical Challenges.” PLoS Medicine 15 (11): e1002689.