Visualizing ClinicalTrials.gov Linked Data in Tableau using RDF2X

David Příhoda
Oct 15, 2019 · 6 min read

Linked Data sources store a growing volume of valuable information represented in RDF format. However, relational databases still provide advantages in terms of performance, familiarity and the number of supported tools. We present RDF2X, a tool for automatic distributed conversion of huge RDF datasets to the relational model. In this tutorial, we provide a high-level demonstration by converting the 45GB ClinicalTrials.gov RDF dataset into a relational database and visualizing it using Tableau.


About ClinicalTrials.gov

ClinicalTrials.gov is a database of publicly and privately supported clinical studies of human participants conducted around the world. In this case study, we use its RDF representation created by the Bio2RDF initiative from publicly provided XML files. In this case study, we will focus on the following topics, as described by the ClinicalTrials.gov Glossary:

  • Clinical study A research study using human subjects to evaluate biomedical or health-related outcomes.
  • Eligibility criteria The key standards that people who want to participate in a clinical study must meet or the characteristics they must have.
  • Condition The disease, disorder, syndrome, illness, injury or other health-related issue that is being studied.
  • Intervention A process or action that is the focus of a clinical study. Interventions include drugs, medical devices, procedures, vaccines, and other products that are either investigational or already available.
  • Term An entity that unifies names of drugs, conditions and other concepts related to a clinical study.
  • Country Country where a clinical study takes place.
  • Location Site where a clinical study takes place.

Data preparation

The ClinicalTrials.gov RDF dataset can be downloaded from the official Bio2RDF repository, we will use the Release 4 provided as a single gzipped N-Quads file of 159 million quads (45.1 GB when uncompressed).


Conversion from RDF to a relational database

We will use Maven to run RDF2X locally from source. You can install RDF2X by following the instructions on https://github.com/Merck/rdf2x.

To get a schema supported by visualization tools, we will use the Type predicates relation schema strategy, which will produce a separate many-to-many relation table for each pair of related entity tables and the type of relationship (RDF predicate):

mvn exec:java -Dexec.args="convert \
--input.file /path/to/input/ \
--output.target DB \
--instances.repartitionByType true \
--db.url jdbc:postgresql://localhost:5432/db \
--db.user user \
--db.password password \
--relations.schema TypePredicates"

The conversion process took 9 hours and 10 minutes on a laptop with a 2GHz 4-core processor and 8GB RAM. The data was loaded to a PostgreSQL database running on the same machine.


Relational database generated by RDF2X

To get an overview of the converted data, we can first look at the schema of our created database. In total, 63 entity tables and 491 relation tables were created:

Graph visualization of ClinicalTrials.gov entity relationships. Nodes represent entities, edges represent many-to-many relationships, node sizes and edge weights correspond approximately to numbers of rows. Entities and relationships that we will focus on in this tutorial are highlighted in blue.

Visualization using Tableau

To visualize the data, we will use Tableau, a platform that will allow us to connect directly to the relational database and create plots simply by dragging and dropping table fields.

First, we will look at basic information about clinical studies. We create a Tableau Data Source by providing our database credentials and selecting relevant tables that we will be focusing on. Thanks to the foreign keys added by RDF2X, Tableau will join our tables automatically:

Definition of our first Data Source in Tableau with three entities: Clinical Study, Country and Eligibility.

Each study has a Start Date field, we can therefore plot the number of studies per year. We will also include the Is FDA regulated field which determines whether the study is regulated by the U.S. Food and Drug Administration:

Number of clinical studies per start year. We can see that most recorded studies started in the past two decades and that roughly one third is FDA regulated.

Most clinical studies also define a Completion Date, we can therefore look at their duration:

Distribution of clinical study duration in months. The distinct peaks are located at 12-month intervals, since it is apparently common that a study is scheduled in terms of years.

Next, we can look at countries where the clinical studies were conducted. We can see that almost half of all studies were conducted in the United States:

Number of clinical studies by country.

Next, we will focus on four eligibility criteria for participation — Minimum Age, Maximum Age, Gender and Healthy Volunteers. The last field indicates whether the study allows people who do not have the condition or related conditions or symptoms to participate in that study.

The minimum and maximum age criteria can be visualized together using a highlight table:

Percentage of studies by minimum and maximum required age. Age limits are grouped to bins of ten years. The most frequent combination is minimum age between 10 and 20 years and no limit on maximum age. By further inspection using a custom SQL query, we see that 34.2% of all eligibility criteria define a minimum of 18 years and no maximum age.

Next, we can look at which gender is requested and whether healthy volunteers are accepted:

Percentage of studies by two eligibility criteria — gender and whether healthy volunteers are accepted. In most studies, only volunteers with related conditions or symptoms are accepted. Most studies accept any gender.

Next, we examine conditions, the focus of each clinical study. The conditions of each clinical study are defined in two ways — by specific free-text labels and by one or more unified terms. The free-text labels are stored in the Condition entity table, most labels are rather specific and referenced only by one clinical study. The unified terms are stored in the Term entity table.

Here you can see an overview of the most used condition labels and terms:

Number of studies for the combination of top 10 most frequent condition labels and condition terms.

We see that there is a significant overlap between the two entities — some condition labels such as HIV Infections or Hypertension have corresponding term identifiers. However, some common condition labels such as Healthy, Obese, Breast Cancer or Prostate Cancer do not have corresponding terms. Therefore, we decided to focus on the free-text label from the Condition entity in the next visualizations to determine the most studied conditions.

We can examine the development of the most frequent conditions by plotting number of their studies by year:

Number of studies per year for the most frequent condition labels. We can see that the number of clinical studies of HIV Infections dropped substantially in the last decade.

Finally, we will focus on clinical study interventions. Just like conditions, interventions are defined in two ways — by specific free-text labels and by one or more unified terms. The free-text labels are saved in multiple entities based on intervention type, for example Drug, Procedure and Biological. The unified terms are stored in the Term entity table. To see the distribution of the number of times a label or a term is referenced, we defined a custom SQL datasource using the following SQL queries.

Distribution of the number of times a drug label is used:

SELECT drug.title_en, COUNT(*)
FROM clinical_study_intervention_drug
INNER JOIN drug ON drug_id=drug.id
GROUP BY drug.title_en
Number of studies referencing our drug labels. Vast majority of drug labels are only used once. This is due to the fact that drug labels often contain the dosage and other specific information. Only 183 distinct drug labels are referenced by more than 100 studies.

Distribution of the number of times an intervention term is used:

SELECT term.title_en, COUNT(*)
FROM clinical_study_intervention_browse_term
INNER JOIN term ON term_id=term.id
GROUP BY term.id, term.title_en
Number of studies referencing our intervention terms. Most intervention terms are reused, 436 of them by more than 100 studies.

We can also plot the number of studies per year for the top five most frequent intervention terms:

Number of clinical studies per year for the five most frequent intervention terms. All five of the interventions are chemotherapy drugs approved for medical use by the FDA. The most recent drug is Bevacizumab, approved in 2004, explaining the visible increase in the following years.

Finally, we can look at the most frequent interventions for each one of the most studied conditions by looking at the number of times they are referenced together by the same study:

Top 3 most frequent interventions for the top most frequently studied conditions

Conclusion

In this tutorial, we provided a high-level demonstration of our tool by converting the ClinicalTrials.gov 45GB RDF dataset provided by the Bio2RDF project. We visualized the converted relational database using Tableau, demonstrating one of the possible benefits of automatic conversion of RDF data to the relational model.

Check out https://github.com/Merck/rdf2x for RDF2X usage instructions.

In our next article, we explore Wikidata using SQL by converting a music-related subset to a relational database.

David Příhoda

Written by

Software engineer with focus on machine learning and bioinformatics

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade