AutoTrial: A Leap Towards Automated Clinical Trial Design — Unveiled at EMNLP’23

Jimeng
8 min readOct 8, 2023

--

Clinical trials are the cornerstone of drug development, ensuring the safety and efficacy of new medical interventions. However, the process of designing robust clinical trials is complex and demands precise eligibility criteria for participant recruitment. A significant challenge arises in devising these criteria, with nearly 57% of trial protocols undergoing at least one substantial amendment, leading to substantial financial losses and time delays.

In our new paper accepted at EMNLP’23, we introduce “AutoTrial: Prompting Language Models for Clinical Trial Design,” a new method aiming to automate the design of clinical eligibility criteria leveraging Large Language Models (LLMs). This method stems from the ability of LLMs to generate coherent and human-like text, an attribute we harness for automating the critical planning stage of clinical trials. The paper focuses on generating eligibility criteria for clinical trial protocols, but the methods used can be extended to cover other sections of the trial protocols.

Key Features of AutoTrial:

1. Comprehending Instructions:
— AutoTrial is designed to comprehend key trial information and additional instructions to generate precise eligibility criteria tailored to the specified objectives of a trial.

2. Referring to Prior Studies:
— Mirroring the human experts’ practice of referencing prior successful trials, AutoTrial leverages context information to generate enhanced trial designs.

3. Rationalizing the Generation:
— Offering a rationale behind the generated criteria is a fundamental feature of AutoTrial, enabling clinical experts to understand and adopt the generation results in practice.

4. Technical Enrichments:
— Instruction prompting for granular control, scalable knowledge expansion via a hybrid of external and internal memory, and explicit supervision for generating grounding rationales are among the technical features that power AutoTrial.

AutoTrial Method

AutoTrial uses a decoder-based architecture to generate a target criterion based on input trial synopsis and manual instructions. The training process has two stages: pretraining and finetuning.

In the pretraining stage, the model is trained on a large corpus of trial documents. This helps the model to learn to reason through multiple steps and mimic the retrieved input criteria exemplars.

In the finetuning stage, the model is trained to generate the target criterion according to the input instructions. For instance, if there is an instruction <age> that urges the model to populate the criterion describing the participant’s age requirement, it will do so accordingly.

It is worth noting that the model can be extended to new instructions and trial exemplars without retraining. You can see the flowchart of the process in Figure. 1.

Next, we will discuss the details of the training and inference procedures of AutoTrial.

Figure 1: The workflow of the proposed AutoTrial. Step I: pre-train on unlabeled trial documents with prompts to mimic the multi-step reasoning. Step II: finetune the model to generate criteria under instructions. Step III: generate diverse target criteria by instructions with large-scale sampling plus clustering and ranking.

Problem Setup
The generation model, represented by function f, creates a target criterion yc based on input x = {xs, xe, xr}. Here, xs signifies trial setups, a blend of the trial title, condition, and treatment as depicted in Figure 1. xr represents the discrete prompt outlining the objective criterion, for instance, “bmi” prompts the model to formulate the criterion for participants’ body mass index. xe refers to exemplars fetched from pertinent trials, aiding the in-context learning of LLMs, and is defined as xe = {xe^t, xe^r, xe^c}, encompassing the reasoning steps xe^t, the targeting instruction xe^r, and the target criterion xe^c which delineates the requirement per the instruction.

Additionally, a continuous prompt hp steers the model, tailored to each instruction type, like the targeting entity the criterion must encompass. The model is trained to sequentially generate criteria y through multi-step reasoning, culminating in the target criterion, expressed as y = f(xs, xe, xr, hp) in Eq. (1). With reference to exemplar xe, the model outputs y = yt ⊕ yc, where yt entails the reasoning steps and yc represents the target criterion.

Hybrid Prompting Strategy

AutoTrial introduces two different prompting strategies namely discrete and neural prompting to facilitate LLM to generate criteria based on specific instructions.

  1. Discrete prompting:

The discrete prompt is motivated by the prospect of in-context learning, as the reasoning ability of LLMs can be enhanced via the input-output exemplars, e.g., the concatenation of a series of criteria xe^t, the target instruction xe^r, and the target criteria xe^c. We formulate the discrete prompts with specialized tokens:
Got it! Let’s dive straight in without the concert vibes:

Trial Setup: This is all about laying down the foundation. It introduces the main components of the study, which can be thought of as the building blocks. The tags like <title>, <disease>, and <treatment> give an overview of what the trial is all about.

For example:

<title> The Ultimate Drug Trial
<disease> Diabetes
<treatment> MagicPill 2.0

In-context Exemplar: This part is about setting specific conditions for participants. It determines who can join the trial and who cannot. Using tags like <inc> and <exc>, the criteria for inclusion and exclusion are clearly defined.

For instance:

<inc> Must have had Diabetes for over 5 years
<exc> No history of heart disease

To help guide the model in generating the criteria, there’s an instruction wrapped with the <statement> tag. This tag essentially points the model in the right direction.

For example:

<statement> age Leading to:
<target> age is above 18 yrs old

Textual Instruction: This is a direct command that emphasizes what the model should focus on next. It’s like providing a spotlight on a specific topic. The <statement> tag is used to specify this topic or area of interest.

For instance:

<statement> gender

By providing these structured prompts and guidelines, the model can generate answers that align with what’s needed. It’s a way of streamlining and enhancing the communication between the user and the model.

Sure, let’s tighten that up:

2. Neural Prompt

In addition to discrete prompting, there is another strategy called neural prompting. This method occurs at the embedding level and is described in section 3.2.2 of the paper. While the mathematical notation may be complex, the high-level concept is simple. The approach involves embedding a text input x<l into matrix H<l and supplementing it with another embedding hp=MLP(Er[i,:]) that corresponds to the i-th instruction from instruction set I. Er is a trainable embedding matrix. The final augmented embedding is denoted as \tilde(H<l) = hp⊕H<l.
Neural prompting is modular, making it easy to incorporate additional instructions I′ by simply expanding the index set I ={I, I′} and the embedding matrix Er ={Er, E′r} for those instructions. When fine-tuning the model on new data, we can update only the instruction embedding E′r while keeping the rest of the model frozen. This allows the model to learn how to generate based on a wider range of instructions while minimizing the risk of catastrophic forgetting, which is the performance degradation on previous data.

Multi-stage Training

AutoTrial creates a dataset that includes pairs of input instructions (represented as xr) and their respective criteria (represented as yc). We derive clinical relationships from the raw criteria to formulate the training and testing data. For example, we extract the relation "NYHA ∈ {III, IV}" from the criteria "NYHA class is above II." However, the parser may not always be able to extract all relevant instructions from all available trial documents. Therefore, we suggest training our method in two stages: first, pre-training on a large set of unlabeled trial documents, and then fine-tuning on the processed dataset of instruction-criteria pairs. This approach enables us to make the most of the available data and improve the model's performance.

Pretraining. We create a pretraining dataset
Cpre = {(xs, xe, yt, yc)i} , where the model f tries to generate y = yt ⊕yc in Eq. (1). The inputs comprise the trial setup xs and the exemplar xe which is also composed of multiple criteria. We decide to include prompts and special tokens in the pretraining stage. Specifically, we explicitly emphasize the step-by-step reasoning task by inserting the separate tokens <inc> and <exc> into xe and yt, and the model is supervised to generate the intermediate rationales and yield the target criterion.
Our method is built based on decoder-based CLM (e.g., GPT2 (Radford et al., 2019)) where the decoder predicts y autoregressively. Denote
the learned decoding distribution as pθ(·), the objective is the maximum log-likelihood estimation given by

where y<l are tokens in y before the l-th token; L is the total number of tokens in the target y.

Finetuning. After pretraining, the model is finetuned on the dataset C, and taught to follow the instruction when generating criteria. The inputs and outputs are described in Eq. (1). In addition to the MLE loss above, we apply a contrastive loss L_CL to enhance the model representation learning, as in

where hyl is the embedding of token yl, ρ is the pre-defined margin, s(·) is the cosine similarity function.

The finetuning loss combines the objectives into L_FT = L_MLE + L_CL.

Generation
Denote the vocabulary by V, we conduct top-k sampling repeatedly to acquire diverse candidates

where V(ks) is a subset of V that maximizes ∑y∈V(ks) pθ(y|y<l, x, hp), and |V(ks)| = ks.

We further adopt clustering and ranking to select samples from the generated candidates. We first encode ˆy by Trial2Vec to dense embeddings hˆy and apply k-means clustering with kq clusters. We then compute the perplexity ( ppl ) of each output ˆyq, and pick the sample with the minimum ppl in each cluster to form the final candidate set with kq samples.

Experiment Performance:

AutoTrial has demonstrated a high level of efficacy, achieving a precision score of 0.91, a recall score of 0.92, and an F1 score of 0.91 in clinical accuracy evaluations. This performance is significantly better than the baselines, which scored less than 0.5 in these metrics. Additionally, AutoTrial performed robustly with a winning rate of around 60% against GPT-3.5 in trial design tasks through human evaluation. For more detailed experimental results, please refer to the paper. Some experiment tables and figures are also available for in-depth study.

Implications and Future Directions:

AutoTrial marks a significant stride towards utilizing AI to facilitate clinical trial design, ensuring a more efficient, accurate, and streamlined process. The success of AutoTrial opens up exciting avenues for further research, potentially revolutionizing the landscape of clinical trial design and speeding up the drug development pipeline.

The presentation at EMNLP’23 will delve into the proposed method in detail, showcasing the experiment results and discussing the future trajectory of this initiative. AutoTrial is not just a novel method; it’s a promising step towards harnessing the power of AI in addressing real-world challenges in healthcare, ensuring faster and more reliable clinical trial designs for a better tomorrow. We are actively deploying AI based trial design solution called TrialMind based on AutoTrial. If you are interested, please check out Keiji.ai

--

--

No responses yet