Making Data Meaningful at TrialSpark

6 min readDec 7, 2018

TrialSpark’s mission is to bring new treatments to patients faster and more efficiently, and that presents a unique set of challenges that differentiates us from other technology companies. One such challenge concerns the size of the data we’re analyzing.

While there’s a lot of buzz around big data in healthcare, many of the critical, complex challenges inherent in improving clinical trials are distinctively small data problems. Two prime examples of this are patient recruitment and the variability of protocol inclusion/exclusion criteria. Let’s take a closer look at each case as well as our larger approach to small data at TrialSpark.

Patient recruitment challenges lead to smaller data sets

Patient recruitment is one of the most important aspects of a successful clinical trial. If you can’t recruit enough patients, you can’t run a clinical trial powerful enough to be conclusive. It’s as simple as that.

Though a typical Phase 2 trial may cost upwards of $20 million, it may only require 40–70 patients. Seems doable, right? As a company that recruits patients via digital advertising, in-network doctor electronic health records (also known as EHR) and other methods, we know that this is no small task.

It takes several months and a considerable amount of effort to enroll the number of patients required for a study. Some studies never even achieve their target enrollment and must be canceled. Though our outreach platform touches millions of people across multiple digital channels, each patient candidate must still pass through a complex recruitment funnel tailored to that particular study before they can be accepted into a trial and assigned to a treatment group.

Our recruitment funnel consists of landing page steps, questionnaires, phone screens and in-person screens. It can take weeks to complete, and there’s considerable drop-off at each stage for a variety of reasons — some within our control and some not.

There are also a great deal of human resources involved at each stage of the funnel — from call centers to appointment schedulers to project managers and beyond. Even the process of leading patients to the top of the funnel is a costly endeavor when factoring in digital advertising and partnership spend. That’s why we believe the optimization of the patient recruitment funnel is a key component of overall cost reduction.

While there’s a lot we can do to enable that optimization, some difficulties in patient recruitment are out of our control.

There needs to be near perfect alignment of a patient’s need for a clinical trial and our initiation of a trial for that individual’s specific condition.
A patient’s medical history may not be available to us, causing inefficiencies in the recruitment flow.
Our onboarding window is ephemeral — enrollment periods typically last just a few months.
Many studies have onerous onboarding requirements and potentially burdensome study protocols, such as requiring the patient to stay overnight at a trial site, long-distance travel on a weekly basis or that the patient cease their current medication.

The patient has to really want to enroll in a study in order to look past these inconveniences. Now imagine doing this dozens of times each year, for each study we’re running. As we continue to segment our patients in this way, our resulting sample size decreases rapidly.

Protocol eligibility criteria excludes much of the population

Over time, we aim to make clinical trials cheaper and faster — but the analytical challenges resulting from small data sets are likely to remain the same. Our studies range widely in terms of health issue (also known as indication), operational complexity, potential patient population, geographic location and many other dimensions. This means that generalizations and aggregations across data sets must be handled with care.

For example, every protocol we receive will have a distinct set of criteria that a patient must satisfy to be included in the trial. This is known as the inclusion/exclusion criteria. This criteria is extremely specific and can dramatically reduce our pool of potential patients. With limited data, it’s often difficult to identify which patients satisfy all the criteria from the outset; oftentimes, patients make it to the in-person screening stage before being removed from the recruitment funnel.

Because study heterogeneity is so vital to achieving significant results, we can’t utilize the generalized “big data” technologies that have impacted the way many people buy products, consume entertainment and share information. To make important decisions informed by our limited data sets we instead need to perform thoughtful, bespoke modeling work using both modern approaches in machine learning as well as statistical best practices that are often decades older.

Small data sets require thoughtful analysis

In order to work with the small data sets generated by complex recruitment funnels and idiosyncratic protocol inclusion/exclusion criteria, we need to get creative with our modelling. The basic methodology for designing clinical trials was set in place nearly a century ago by Ronald Fisher, one of the founding fathers of modern statistics. Though his principles regarding randomization, replication and other design aspects were first applied to agricultural experiments, by the 1930s they were widely used in medical treatment studies.

Since then, advances in statistical methods, computational power and our understanding of human biology have led to a number of exciting applications of machine learning and statistical inference to the clinical trial space. Methods for understanding and quantifying uncertainty are particularly important for our work; we employ them when making data driven-decisions and building new product features.

Sometimes, relying on frequentist techniques alone can be insufficient with small data sets due to limited model descriptiveness and high levels of noise to signal ratio. Therefore, we use Bayesian methods to develop more illustrative, hierarchical models, and to reduce uncertainty in our predictions by incorporating prior knowledge.

For example, in the patient recruitment funnels described above, we use ‘subjective’ priors (informed by domain experts) at a given step and update them as data comes in. The output for one step then becomes the input to the next step, in a hierarchical fashion. This provides us with distributions instead of point-estimates for our conversion rates. As more patients progress down the funnel, our posterior distribution naturally updates and our projections become more data-informed and actionable as a result.

In general, training predictors on small data sets can be particularly prone to overfitting, so we employ regularization methods, model averaging and bootstrapping techniques to combat this. Since our current data sets are neither of the type nor scale that would benefit from off-the-shelf deep learning techniques, we have a battery of methods at hand to make data-informed decisions across the org.

Python is our language of choice for research and development. It provides a powerful set of libraries for numerical work with small data sets (PyMC, scikit-learn, scipy, Gensim, spaCy and PuLP to name a few) as well as Jupyter notebooks and various plotting libraries for prototyping work and visualizing results.

Finally, we have a team of clinicians with decades upon decades of combined medical experience. Bayesian methods let us operationalize their expertise for predictive modeling by skillful specification of prior distributions.

Simply put, improving clinical trials will save lives

Clinical trials are the gold standard for verifying new medical advances, and improvements in this space will have a transformative impact on countless lives in the future.

We all know someone whose life was saved due to a clinical trial or as a result of a treatment being brought to market. Over time, our technology aims to empower people to learn about and participate in the latest medical advancements, thus bringing them one step closer to a potentially life-saving treatment. We hold our work to the highest standard, not only because it’s essential to our business, but because each data point represents the health and wellbeing of a person.

Changing the status quo

TrialSpark is revolutionizing the way clinical trials are run. We’re reimagining the process by applying first principles thinking to technology and data. It requires us to design models creatively, challenge existing assumptions and methodologies and be rigorous in our approach. Our patients demand no less and we hold ourselves to that standard, every step of the way.

Right now, we’re building a diverse, empathetic and world-class team to achieve these goals and many more. If this sounds like a mission and organization that you want to be a part of, drop by our careers page and get in touch. We can’t wait to meet you.

Making Data Meaningful at TrialSpark

Written by Chris Ryan