Photo by CDC on Unsplash

How AI Will Revolutionize the Process of Clinical Trials

Ali Usman
24 min readDec 31, 2023

In this article, I will be creating three machine learning models for clinical trials that are able to predict whether a clinical trial will succeed or fail. The three machine learning models will predict clinical trial outcomes, assess patient mortality to help with trial enrollment by mitigating patient risks, and improve trial design through trial similarity searching!

One of the most promising drugs ever tested was Entresto in 2019 for heart failure. Heart failure accounted for 700,000 deaths and was the leading cause of death in the United States. Entresto had the potential to save hundreds of thousands of lives and was projected to reach a peak sale of 5-billion dollars. However, in phase III clinical trial testing with 4,822 patients Entresto failed and didn’t reduce cardiovascular death. The clinical trial of Entresto lasted for 5 years (2014–2019) and costed $200 million. The outcome of the Entresto trial highlighted major flaws in the clinical trial process.

What if we could predict the outcome of the Entresto clinical trial before wasting $200 million and 5 years of testing? What if we could optimize clinical trials using AI?

Table of Contents:

  1. What are Clinical Trials
  2. PyTrial — Machine Learning for Clinical Trial Applications
  3. Clinical Trial Outcome Prediction
  4. Patient Outcome Prediction
  5. Trial Similarity Searching
  6. The Future of Clinical Trials is AI

What are Clinical Trials?

Four different phases of clinical trials. Photo from Lustgarten Foundation.

In drug development, clinical trials test a drug candidate or treatment on patients. Clinical trials help to determine a drug’s safety and efficacy through four phases. A new drug must pass through phases I, II, and III before receiving FDA approval. Phase IV trials are conducted after FDA approval to monitor the drug’s safety and effectiveness.

In a nutshell the four phases of a clinical trial are:

Phase I: Phase I involves around 20 to 100 patients to test out the drug. Phase I can take several months and tells us valuable information about what a drug’s safety or dose should be. Around 70% of drugs make it past to the next phase.

Phase II: Phase II can take around several months to 2 years to test a drug’s efficacy and side effects. Phase II can involved several hundred patients, and only 33% of drugs make it to the next phase.

Phase III: Phase III tests for a drug’s efficacy and monitors any adverse reactions. 300 to 3,000 patients test out the drug candidate over the course of 1 to 4 years, with only 25–33% of drugs moving to the next phase.

Phase IV: Phase IV comes after FDA approval and is used to monitor the drug’s safety and efficacy.

The estimated cost for pivotal clinical trials came out to be $48 million dollars with an interquartile range (IQR) of $20 million to $102 million. In a 2014 study by Tufts, researchers estimated the average cost of a Phase III trial to be $225 million. With $225 million you could buy 450 houses or 32 Superbowl Ads!

Why does it matter how much money we spend when we are creating these life-saving drugs?

While creating life-saving drugs is a great use of resources, the hard truth is that only 10% of drugs pass clinical trials meaning 90% of drugs fail in the clinical trial stage. A massive amount of money, time, and resources are funneled into clinical trials that produce NO significant results. There has to be a better way to optimize clinical trial. To figure out how to change the process of clinical trials, we first need to understand what major problems lead to high costs and unsuccessful results in clinical trials.

Major Problems and Solutions for Clinical Trials

Clinical Trial Outcomes: As mentioned before 90% of clinical trials fail. This is due to many reasons such as the lack of a drug’s efficacy, poor-drug like properties, and poor trial design. Imagine how much money and time could be saved if we simulated the results of a clinical trial before they even happened. We would be able to know whether a clinical trial would succeed or fail. Well, with AI this is no longer a dream. An AI model could look at the target disease, drug-candidate, eligibility criteria for patient recruitment, and other trial design consideration to give an output of whether the clinical trial would fail or succeed. By implementing AI to predict clinical trial outcomes, we could save $48 million per pivotal clinical trial by eliminating all failed clinical trials!

Patient Outcome Prediction: Clinical trials involve thousands of patients recruited to test out the drug candidate. It costs $41,100 just to work with one patient in a clinical trials and with 30% of patients dropping out the costs for clinical trials can become astronomical. Implementing AI to predict a patient’s reaction to treatment would be extremely beneficial for clinical trials. Patient outcome predictions would improve trial enrollment, help to develop personalized treatment plans to reduce the risk of giving harmful treatments to patients, and help with patient selection. For example, if a patient is predicted to have an adverse reaction to a treatment or drug it is safer to not recruit the patient at all.

Trial Similarity Search: In many cases a trial’s design and planning phase can be crucial to a clinical trial’s success. AI can be used to output a similarity score between a current trial and a trial that has already been completed. Using AI to refer back to prior trials has many benefits and can provide references points in trial design such as eligibility criteria, sample size and controls. Scientists can enhance the likelihood of success in a current trial by studying the designs of previous trials. This process is like learning valuable insights from the errors and successes of others, to ensure the success of a clinical trial.

PyTrial — Machine Learning for Clinical Trial Applications

PyTrial is a Python library with great documentation on implementing machine learning (ML) in clinical trials. I’m grateful to have used their library and documentation to create my project. Go check out their research paper for more information!

Here was my process of building machine learning models to tackle three major problem with clinical trials which was clinical trial outcome prediction, patient outcome prediction, and trial similarity searching!

1. Clinical Trial Outcome Prediction

Clinical Trial Outcome Prediction. Photo from PyTrial Research Paper.

As mentioned before 90% of clinical trials fail and by predicting whether a trial will succeed or not millions of dollars could be saved, so here’s how I made the ML model that can predict the outcome of clinical trials.

Cell 1: Importing Clinical Trial Outcome Data

#Load Demo Data
from pytrial.data.demo_data import load_trial_outcome_data
from pytrial.data.trial_data import TrialOutcomeDatasetBase

The PyTrial library importsloads_trial_outcome_datawhich loads all the trial outcome prediction (TOP) data. TOP data has the 3 distinct phases of clinical trials and contains vast amounts of data involving different trials, treatments/drug candidates, conditions/diseases, and features which are target pathways for the drug.

TOP Datasets for each phase. Photo from Pytrial research paper.

Next, the PyTrial library imports TrialOutcomeDatasetBase which uses Pandas, a data analysis library, to handle all the data. This means that the TrialOutcomeDatasetBasehandles all the input data for our machine learning model. The function will trim down on all the data and just use the necessary parts for the ML model.

Cell 2: Splitting the Data

df_train = load_trial_outcome_data(phase='I', split='train')['data']
df_val = load_trial_outcome_data(phase='I', split='valid')['data']
df_test = load_trial_outcome_data(phase='I', split='test')['data']

train_data = TrialOutcomeDatasetBase(df_train)
valid_data = TrialOutcomeDatasetBase(df_val)
test_data = TrialOutcomeDatasetBase(df_test)

This first line imports load_trial_outcome_datato collect all the data for phase I and issues the data for training in the df_trainvariable. A similar process is used to bring up all the data for validation and testing. For validation the variabledf_valstores all the data and the df_testvariable has all the data for testing.

Next, a new variable train_datais created to store all the input data for the ML model. df_traincontains all the data for phase I training and we use TrialOutcomeDatasetBaseto manipulate the data and create the input data. The same process is used to to create the valid_data and test_data variables which have all the validation and testing data for the ML model.

So what does the data used to make the model contain?

Input Data used for Phase I of Clinical Trial Outcome Prediction.

Here are all the different features used in the input data for training, testing, and validation in the ML model:

NCT ID: A NCT ID tells us the identification number of a registered clinical trial and can be used to look up basic information about a trial on ClinicalTrials.gov. The NCT ID was used later on in the model’s output to identify the specific trial predicted.

Status: A status of each clinical trials is given and each trial is either labelled as completed, active and not recruiting, withdrawn, or terminated.

Why_stop: This section of the data provides a brief description (if applicable) of why a trial was either withdrawn or terminated.

Label: A label of 1 is given to active and completed trials to indicate the trial’s success or continuation. A label of 0 is given to terminated and withdrawn clinical trials that have failed.

Phase: For each trial the phase of the clinical trial is given. In this case, all the phases would be I as the model will be predicting outcome for phase I trials.

Diseases: The diseases each clinical trial focuses on is given. The disease is relevant as success rates of clinical trials can be correlated to the specific disease the clinical trials is working to make a treatment for.

ICD Codes: ICD codes or International Classification of Disease codes are alphanumeric codes that give us information on diseases, symptoms, and severity of a disease or injury. An example of an ICD code is “E10.9” which stands for type 1 diabetes.

Drugs: The name of the drug candidate can help to provide more context into the treatment the clinical trial is testing which can help predict the outcome of a clinical trial.

Smiles: Simplified Molecular Input Line Entry System (SMILES) is a notation system that represents a chemical structure in text. For example, a simple “C” represents CH4 or methane. SMILES represent the chemical composition of a drug candidate and can provide us information on a drug’s efficacy, structure, and properties of the compound. All this information about a drug candidate can help to predict the outcome of a clinical trial.

Criteria: If applicable any information on the criteria of a clinical trial is provided. The criteria of a clinical trial can provide a holistic view on whether a clinical trial had the intended results or not.

Title: The title of the clinical trial is provided which can help verify the model’s output of whether the clinical trial failed or succeeded. The title can also provide information on the what the clinical trial is testing.

Study_first_submitted_data: This part of the data tells us when the study was first submitted to ClinicalTrials.gov. Overall, all these pieces of data provides crucial information on clinical trials and helps with determining whether a clinical trial succeeded or not because the definition of success is different for each clinical trial.

Cell 3: Preparing the XGBoost Model

Now that all the data is inputted the model must be made and then trained.

from pytrial.tasks.trial_outcome import XGBoost
model = XGBoost()

The PyTrial library imports the XGBoost (eXtreme Gradient Boosting) model for training. The XGBoost model is a tree ensembled method that uses gradient boosting. Now what does this actually mean?

The XGBoost model combines multiple individual models typically decision trees as base learners and uses gradient boosting to design a more accurate model.

Structure of decision trees. Photo from Nvidia.

A decision tree is a hierarchical structures that comprises of nodes that represent the input and branches that denote outcomes of different decisions. Each node in a decision tree analyzes a particular subsection of the data with each branch representing possible answers and outcomes. The XGBoost model uses sequential decision making to provide an outcome. Next, the XGBoost model utilizes gradient boosting which iteratively addresses errors made in previous decision trees to create a more effective model. A loss function is calculated for each decision tree so the model can learn from previous mistakes and refine the model with each iteration. The XGBoost model also uses ensemble learning which averages the prediction/output made by all the individual decision tree models to provide a more accurate output. An XGBoost model was chosen as all these features improve accuracy in clinical trial outcome prediction.

model = XGBoost this line specifies that model is an XGBoost model.

Cell 4: Training the Model


model.fit(train_data, valid_data)

#Example of Validation Accuracy showing the first 10 lines in total there were 100 lines
[0] validation_0-auc:0.61418
[1] validation_0-auc:0.64493
[2] validation_0-auc:0.63045
[3] validation_0-auc:0.61119
[4] validation_0-auc:0.59806
[5] validation_0-auc:0.58075
[6] validation_0-auc:0.58821
[7] validation_0-auc:0.58269
[8] validation_0-auc:0.56955
[9] validation_0-auc:0.58418
[10] validation_0-auc:0.57433

Now, we use the model.fit function to train the model with the training data and then validate the predictions. During training the model analyzes the training data and learns from the data to make a prediction. The model.fit function optimizes parameters when it trains the model, which leads to higher accuracy. This means that the model goes through the entire dataset, makes predictions, calculates losses, and reiterates parameters in the model. Lastly, the model validates its predictions using valid_data and outputs an accuracy.

Cell 5: Predicting Clinical Trial Outcomes

Once the model is made we can test the model on clinical trials and see whether the model is accurate or not.

prediction = model.predict(test_data)
print(prediction[:50])

We create a new variable called prediction that gives us an output on the testing data. We use the model.predict function to obtain predictions from the ML model. model.predict first uses the input data for which we wanted predictions for, applies the XGBoost ML model to make predictions, and stores an output for the predictions.

print(prediction[:50]) this line prints out 50 clinical trials and their predicted values that indicate success and failure.

50 Clinical Trial Outcome Predictions

The predictions all have an NCT ID which specifies what type of clinical trial it was. Next, each predicted clinical trial has a value from 0 to 1. 0 indicates the failure of a clinical trial and 1 indicates the success of a clinical trial.

Success Rate of Two Clinical Trials

In particular, I wanted to focus on clinical trials NCT02000700 and NCT02050815. The model predicted clinical trial NCT02000700 had close to a 100% success rate and so would be successful. In contrast, NCT02050815 had the lowest success rate at only 9% meaning the trial had a 91% chance of failing. We can also verify our findings on ClinicalTrials.gov and check if the model was right or not in its prediction.

Completion of Clinical Trial NCT02000700. Photo from Clinical Trials.gov.

The model predicted clinical trial NCT0200700 had a 100% success rate and the trial was completed proving the model was correct!

Termination of Clinical Trial NCT02050815. Photo from ClinicalTrials.gov.

The model only predicted a 9% success rate for NCT02050815 meaning there was a 91% chance of failure. The model accurately predicted this low chance of success as the clinical trial was terminated!

2. Patient Outcome Prediction

Patient Outcome Prediction. Photo from PyTrial Research Paper.

In clinical trials, 30% of patients drop out leading to massive financial burden. By implementing AI and ML to assess a patient’s mortality score clinical trials can save time, money, and select the right patients to conduct a successful clinical trial. Here’s how ML can be used to predict patient outcomes!

Cell 1: Inputting Patient Outcome Data

from pytrial.tasks.indiv_outcome.data import SequencePatient

We use the PyTrial library to import SequencePatientwhich contains seven different oncology clinical trials with thousands of patients of varying conditions.

Dataset Imported for Patient Outcome Prediction. Photo from PyTrial Research Paper.

A new variable seqdata is created that stores the new inputted training data coming from the importedSequencePatient dataset.

# Build the train data
seqdata = SequencePatient(
data={'v':data['visit'][:-200], 'y':data['y'][:-200], 'x':data['feature'][:-200]},
metadata={
'visit':{'mode':'dense', 'order': data['order']},
'label':{'mode':'tensor'},
'voc':data['voc'],
'max_visit':20,
}
)

In Cell 1seqdata which is the training data contains information on patients visits (vist), demographics (feature), order of events in visits (order), information on patient level outcomes (y), and a patient’s medical history (voc).

vist: vist represents the number of visits each patient had. For example, a visit could be for medical events or check-ups a patient undergoes during the clinical trial.

feature: feature contains information about the patients in a tabular form. This could include a patient's demographic data, medical history through electronic records, and gene expression.

order: order is a variable representing the order of events in a patient’s visit. For instance, a visit could be a diagnosis, medication, treatments or checkups. order looks at the sequence of events in a visit.

y: yis the target label or the output that the model is trying to predict. For example, in our patient-level outcome prediction model, y would contain the predicted outcomes for each patient, such as the likelihood of developing a certain condition or experiencing a specific event during the clinical trial.

voc: vocstands for vocabulary and specifies events. It is a term used in ML to map index numbers to the exact names of events which contains the encoded event data either in numerical or categorical values. For example, in clinical trials, the vocvariable might contain the specific names of medical procedures, diagnoses, reports on a patient’s well-being, or reaction to treatments. This allows for the interpretation of the data and predictive analysis.

All these variables allow us to look deeper into a patient’s history and reaction to treatment to help with patient outcome prediction. Interestingly to note, all the data is considered sequential because the data is arranged in a sequences where the order of data points matter. In sequential data, each data point is dependent on other data points in the sequence. Our data is sequential as the number and order of visits affect other data points.

#Build the test data
val_seqdata = SequencePatient(
data={'v':data['visit'][-200:], 'y':data['y'][-200:], 'x':data['feature'][-200:]},
metadata={
'visit':{'mode':'dense', 'order': data['order']},
'label':{'mode':'tensor'},
'voc':data['voc'],
'max_visit':20,
}
)

val_seqdatais the testing data for the ML model that was created the same wayseqdata was created. First, SequencePatient is loaded to bring up all the data and then the dataset is reformatted to become testing data for the ML model. The new testing data val_seqdata is made with information on patients visits (vist), patient’s demographics such as race or gender(feature), order and sequence of events in visits (order), information on patient level outcomes (y), and a patient’s medical history (voc) to support predictive analysis.

Cell 2: Importing the RNN Model

from pytrial.data.patient_data import SeqPatientCollator # function to process the input SequencePatient dataset
from pytrial.tasks.indiv_outcome.sequence import RNN

With the large amount of input data given we needed a function to process all the data for the model. To process all the data, we use the PyTrial library and import SeqPatientCollator. Next, we import the RNN (recurrent neural network) model as the model type. A RNN model is well-suited to handle sequential data and so was chosen for the patient outcome prediction model. RNN’s have the ability to remember and utilize information from previous time steps and so RNN’s handle sequential data well.

RNN’s vs. Feed-Forward Neural Networks. Photo from GeeksforGeeks

RNN’s differ from traditional feedforward networks in their approach to handling data. Feed-forward neural networks process input data linearly while RNN’s are recursive and can loop.

Layers in an RNN model with cyclic connections. Photo from ResearchGate.

RNN’s also have cyclic connections meaning the model can retain and propagate information through time steps meaning they can go back and forth which is useful for sequential data. Now that we know what RNN’s, let’s see how they trained the patient outcome prediction model!

Cell 3: Creating Model Parameter's

model = RNN(
vocab_size=[len(data['voc'][o]) for o in data['order']], # get the vocab size for each type of events to build the event embedding layer
orders=data['order'], # similar, we need an order
mode='binary',
max_visit=20,
bidirectional=True,
epochs=20,
batch_size=16,
device='cpu',
)
model.fit(train_data=seqdata, valid_data=val_seqdata)

We first specify that the model is a RNN with model = RNN, then we go into the specifications of the model. Next, we input the voc function which tells us about the encoded event data that can specify medical procedures, diagnoses, adverse reactions to drugs, and visits for the patients in a clinical trial. The model also looks at other variables in the data such as gene expression, electronic health record’s (EHR), and patient demographics.

vocab_size=[len(data['voc][o]) for o in data ['order]] this line gathers the vocabulary size for each type of event in a clinical trial. Events are specific occurrences or instances in the data and they are built into their own event embedding layers in the model.

orders=data['order] this line created another embedded layer for the order function which is the sequence of events in a visit.

mode=binary the RNN model is set to binary because the model’s output will be from 0 to 1 which is the predicted patient mortality rate’s.

max_visit=20 this line specifies that the model will look at the sequential data with a maximum of 20 visits.

bidirectional=True this means that the model can process input sequences in both forward and backward directions, allowing the RNN model to capture information from past and future steps simultaneously.

epochs=20 specifies the amount of epochs trained for. An epoch is when the model trains on the entire dataset and optimizes the algorithm to minimize the loss function for the next epoch.

batchsize=16 a batch size is the number of training examples utilized in one iteration of the training process. It is a hyperparameter that defines the number of samples that will be propagated through the neural network.

device=cpu this specifies the model will be trained on a central processing unit (CPU).

model.fit(train_data=seqdata, valid_data=val_seqdata) trains the model on data withmodel.fit initiating the training of the model. This line also specifies that the train_data is seqdata which is the sequential data made for testing. The same is also the case for valid_data=val_seqdatafor validation. The RNN model is iterative when training and the model always saves the highest accuracy after going through each epoch.

RNN model training thorough 20 epochs.

Cell 4: Predicting Patient Outcomes

predict = model.predict(seqdata)

print(predict[:20])

#Output:
1.[[0.03759647]
2.[0.3407742 ]
3.[0.08466253]
4.[0.04212925]
5.[0.07062326]
6.[0.61914057]
7.[0.11943546]
8.[0.28911486]
9.[0.03431963]
10.[0.1817149 ]
11.[0.10272896]
12.[0.4975288 ]
13.[0.23787639]
14.[0.53194153]
15.[0.5683119 ]
16.[0.42574254]
17.[0.05658188]
18.[0.10878944]
19.[0.01620503]
20.[0.54372704]]

predict = model.predict(seqdata) this line saves all the predictions from seqdata (training data) into the predict variable. print(predict[:20]) prints out 20 predictions for patient mortality rates.

So what does the output tell us and why is patient outcome predictions important?

The output is a binary number ranging from 0 to 1 and tells us how a person will react to drug candidates or treatment given in clinical trials. For example, Patient 1 has a mortality rate of 0.03 or 3%, this means that the patient would be an ideal candidate for the clinical trial and would be safe to test the drug candidate . On the other hand, Patient 6 has a mortality rate of 0.62 or 62% indicating that the patient would most likely have an adverse reaction to the drug candidate or treatment and so shouldn’t be a patient in a clinical trials. It costs $41,100 to recruit a patient to a clinical trials and 30% of patients drop out, by implementing AI clinical trials could screen patients more effectively and save massive amounts of money. As demonstrated, an AI algorithm predicting mortality rates also helps to ensure the safety of patients in clinical trials.

3. Trial Similarity Searching

Trial Similarity Search. Photo from PyTrial Research Paper

Over 90% of clinical trials fail and these failures can still teach other scientists on trial design. We can all learn from past clinical trial success and failures to improve planning to ensure clinical trial success’ occur much more often. Using AI to provide a similarity score can help scientists plan their own trial by looking at other relevant trials. Similarity scores can help to provide reference points on trial design, patient eligibility for recruitment, outcome measures, sample size, and endpoints. A similarity score can also act as an indicator as to whether a clinical trial is on the path of success or if major changes in the planning need to be made. So here’s how I made a ML model to output similarity scores!

Cell 1: Downloading Clinical Trial Data

from pytrial.data.demo_data import load_trial_document_data
data = load_trial_document_data()
df = data['x'].sample(10000)

We first use the PyTrial library to input load_trial_document_data which contains data on 447,709 clinical trials. Out of the 447,709 clinical trials 311,485 were used in self-supervised training. Self-supervised learning involves using data without giving instructions to the model or labelling the data. Instead, the model learns from the inherent structure of the data and patterns within the input data to come up with conclusions.

data = load_trial_document_data this line downloads the 447,709 clinical trials from ClinicalTrials.gov which was >1.4GB. df = data['x'].sample(10000) creates variable df which reduces the dataset of 447,709 clinical trials to a sample size of 10,000 clinical trials.

Cell 2: Doc2Vec Model

from pytrial.tasks.trial_search import Doc2Vec

Now that we have the input data we need to create the model. We first use the PyTrial library and import Doc2Vec. We use Doc2Vec a model that can find similarities between documents and outputs a numerical similarity score. This computed similarity score can be used in clinical trial documents for trial similarity searching.

How does the Doc2Vec Model work and how can it be used for trial similarity searching?

Doc2Vec is a unsupervised machine learning algorithm that converts documents to vectors. In a Doc2Vec model, a vector is a mathematical representation of a document. Vectors aren’t just a single number, but a set of numbers that capture the meaning and context of a document. Doc2Vec was first presented in this article by Mikolov et. al. Doc2Vec is heavily dependent on the word2vec model and understanding word2vec is crucial for context.

Word2Vec

The word2vec model produces vectors of the words. By producing vectors the word2vec model allows maintains the relationship between words.

A vector space with relationships between words. Photo from Medium article.

The word2vec representation uses two main algorithms: Skip-Gram and Continuous Bag-of-Words (CBOW).

Comparison of CBOW and SkipGram. Photo from Kavita Ganesan.

The skip-gram model tries to predict a word given the surrounding context. The following figure shows that when the target word “sat” is given to the model, it tries to predict the surrounding words with context and outputs “The cat sat on the mat”.

Visual Representation of Skip-Gram

A CBOW model does the opposite process of what the skip-gram model does. CBOW tries to predict the next word with context. In the following figure, the context or inputted words are “the cat sat” is sent to the model and it tries to predict the next word based on context. The model predicts the next word to be “on”.

Visual Representation of Continuos Bags-of-words (CBOW)

Doc2Vec

The Doc2Vec model is an extension of word2vec and so Doc2Vec also employs the skip-gram and CBOW models. Doc2Vec creates vectors that represent similarities between documents and this allows for similarity analysis in a numerical fashion. Doc2Vec also utilizes GenSim (Generate Similar) a widely used Python library for natural language processing (NLP). GenSim helps with creating document vectors and calculating similarities between them. GenSim allows for the adjustment of the dimension of the vector representation, faster training, and the exposure of tuning parameters, making it easier to work with Doc2Vec.

Doc2Vec was used in the trial similarity searching as the clinical trial documents can be generated as vectors and Doc2Vec allows for similarity analysis between clinical trials documents. Doc2Vec gives each clinical trial document a numerical vector that can then be compared with other numerical vector scores to find similar trials.

Cell 3: Preparing the Doc2Vec Model

model = Doc2Vec(
emb_dim=128,
epochs=50,
num_workers=8,
)

model.fit(
{'x': df,
'fields':['title', 'disease', 'intervention_name'],
'tag': 'nct_id'}
)

We first specify the model as Doc2Vec. Next, emb_dim=128 refers to the embedding dimension which tells us the vector space in which documents are embedded. The value of 128 indicates the documents are embedded into a 128-dimensional vector space. Each clinical trial document is a numerical vector with 128 elements, which are real numbers representing the vector representation of a clinical trial document. This allows the model to capture and process the contextual meaning of the clinical trial documents in a 128-dimensional space. This can lead to more effective representations of the documents, leading to a better model.

epochs=50 we then specify the model will train for 50 epochs.

num_workers=8 specifies the number of worker processes that will occur in the model. A worker process is responsible for uploading the data and updating the model weight’s during training. The line specifies 8 worker processes will occur and this can help with improving the efficiency of the training process.

model.fit is used to train the model and the model will look at df which is the 10,000 clinical trial documents. The model will also look at variable field which contains information on the title of a clinical trial (title), the disease or condition the clinical trial is focusing on (disease), and the treatment the clinical trial created (intervention_name).

Cell 4: Printing the Results

preds = model.predict({
'x': df.iloc[:20], # check the top-k similar trials for 20 trials
'fields': ['title'],},
top_k=5)

We create the predsvariable that stores all the prediction from model.predict. 'x': df.iloc[:20]specifies that 20 trials will be outputted. 'fields': ['title'],}, we specify that the titles of 20 clinical trials will be shown. top_k=5 tells that the top 5 most similar trials will be shown. So the output will consist of 20 clinical trials with their study titles and that for each of the 20 clinical trials the top 5 most similar trials will be shown.

for i,pred in enumerate(preds):
test_nct = df.iloc[i]['nct_id']
test_title = df.iloc[i]['title']
print(f'test nct_id: {test_nct}, title: {test_title}, rank_list {pred} \n')

for i,pred in enumerate(preds):creates a loop to go through all the predictions (preds). enumerate is a function used to track the index of each prediction, which is assigned to i.

test_nct = df.iloc[i]['nct_id']:This line extracts the NCT ID ('nct_id’) at the index i (which is the the current iteration in the loop) and assigns it to the variable test_nct.

test_title = df.iloc[i]['title']: The title of each clinical trial is extracted and assigned to the variable test_title.

print(f'test nct_id: {test_nct}, title: {test_title}, rank_list {pred} \n'):prints the NCT ID, title of the study, and numerical trial similarity prediction values for the current iteration of the loop.

So what are the results and what can they tell us?

Trial Similarity Search Output

The output shows clinical trials with their ID and title of the study. Each clinical trial has the top 5 most similar trials brought up as well. For example, the model shows that clinical trial NCT04115358 the study on “Evaluation of Hyaluronic Acid Pulpotomies in Primary Molars” has a 0.76 or 76% trial similarity with clinical trial NCT00702377.

Trial similarity search can help scientists learn from relevant trials to look at endpoints, patient recruiting, trial sites, previous treatments, and much more information. Trial similarity searching can prevent failures in clinical trials by allowing scientists to look at references for their own trial.

The Future of Clinical Trials is AI

Clinical trials are widely inefficient with a success rate of 10%. Clinical trials costs $48 million for pivotal trials and $200 million for phase III clinical trials. It is clear that most clinical trials waste massive amounts of money, time, and resources. However, by implementing machine learning models this doesn’t have to be the case.

  1. As shown with the case of Entresto millions of dollars could be saved if the outcomes of clinical trials could be predicted before they even happened. Scientists would know whether the clinical trial was a failure or success from the start, and this gives them the chance to plan accordingly. For example, if a trial was predicted to fail scientists would have the foresight and chance to adjust the trial’s design before spending millions of dollars. The machine learning model shown in this article demonstrates the ML models for predicting clinical trial outcomes can accurately predict outcomes and thus help save millions in clinical trials.
  2. It costs clinical trials a whopping $41,100 to work with just one patient. Costs become astronomical as hundreds to thousands of patients are needed and with a 30% dropout rate the cost to replace patients skyrockets further. The ML model shown to predict patient outcomes and mortality rates can help assess a patient’s suitability for a clinical trial. The ML model shown looks at a patient’s health records, gene expressions, reaction to the treatment, demographics, and many other factors to provide a holistic view on whether the patient should take the treatments. Predicting mortality rates also allow for patient safety, personalized treatment plans, and patient selection. If a patient has a high mortality rate then the patient shouldn’t be selected for the clinical trial. For clinical trials, patient outcome prediction can also help with trial enrollment, avoiding administering adverse treatments to patients, and save money on selection as patient dropouts would be solved.
  3. The planning and trial design can be crucial as to whether a clinical trial can succeed or not. Over 90% of clinical trials fail and by looking at these failures, new clinical trials can learn from this and adapt to improve their own trial design to succeed. The ML model using Doc2Vec outputs a similarity score for trial and provides reference trials for scientists. A reference trial can provide information on sample size, controls, and patient eligibility. It’s like learning from other people’s work to succeed on your own clinical trial. AI can also provide input on trial design, sample size, treatments, endpoints, measures for safety, trial sites, and outcome measures. A similarity score can also indicate as to whether a clinical trial will success or if major changes in the planning need to be made to avoid failure.

Implementing AI with clinical trials can go beyond clinical trial outcome prediction, patient outcome prediction, and trial similarity searching. AI has so many use cases for clinical trials and AI could also be used for patient matching, trial site selection, and patient data generation. Utilizing AI can revolutionize the process of clinical trials and streamline the approach to save massive amounts of time, money, and resources because AI is the future of clinical trials.

Thank you so much for reading my article! I hope you learnt more about how AI will revolutionize clinical trials! If you want you can reach out to me here:

Email: ali.muhammad.usman08@gmail.com

LinkedIn: https://www.linkedin.com/in/ali-usman-40091925b/

Substack: https://substack.com/profile/173343618-ali-usman

--

--