Visualizing ClinicalTrials.gov Linked Data in Tableau using RDF2X
Linked Data sources store a growing volume of valuable information represented in RDF format. However, relational databases still provide advantages in terms of performance, familiarity and the number of supported tools. We present RDF2X, a tool for automatic distributed conversion of huge RDF datasets to the relational model. In this tutorial, we provide a high-level demonstration by converting the 45GB ClinicalTrials.gov RDF dataset into a relational database and visualizing it using Tableau.
ClinicalTrials.gov is a database of publicly and privately supported clinical studies of human participants conducted around the world. In this case study, we use its RDF representation created by the Bio2RDF initiative from publicly provided XML files. In this case study, we will focus on the following topics, as described by the ClinicalTrials.gov Glossary:
- Clinical study A research study using human subjects to evaluate biomedical or health-related outcomes.
- Eligibility criteria The key standards that people who want to participate in a clinical study must meet or the characteristics they must have.
- Condition The disease, disorder, syndrome, illness, injury or other health-related issue that is being studied.
- Intervention A process or action that is the focus of a clinical study. Interventions include drugs, medical devices, procedures, vaccines, and other products that are either investigational or already available.
- Term An entity that unifies names of drugs, conditions and other concepts related to a clinical study.
- Country Country where a clinical study takes place.
- Location Site where a clinical study takes place.
The ClinicalTrials.gov RDF dataset can be downloaded from the official Bio2RDF repository, we will use the Release 4 provided as a single gzipped N-Quads file of 159 million quads (45.1 GB when uncompressed).
Conversion from RDF to a relational database
We will use Maven to run RDF2X locally from source. You can install RDF2X by following the instructions on https://github.com/Merck/rdf2x.
To get a schema supported by visualization tools, we will use the Type predicates relation schema strategy, which will produce a separate many-to-many relation table for each pair of related entity tables and the type of relationship (RDF predicate):
mvn exec:java -Dexec.args="convert \
--input.file /path/to/input/ \
--output.target DB \
--instances.repartitionByType true \
--db.url jdbc:postgresql://localhost:5432/db \
--db.user user \
--db.password password \
The conversion process took 9 hours and 10 minutes on a laptop with a 2GHz 4-core processor and 8GB RAM. The data was loaded to a PostgreSQL database running on the same machine.
Relational database generated by RDF2X
To get an overview of the converted data, we can first look at the schema of our created database. In total, 63 entity tables and 491 relation tables were created:
Visualization using Tableau
To visualize the data, we will use Tableau, a platform that will allow us to connect directly to the relational database and create plots simply by dragging and dropping table fields.
First, we will look at basic information about clinical studies. We create a Tableau Data Source by providing our database credentials and selecting relevant tables that we will be focusing on. Thanks to the foreign keys added by RDF2X, Tableau will join our tables automatically:
Each study has a Start Date field, we can therefore plot the number of studies per year. We will also include the Is FDA regulated field which determines whether the study is regulated by the U.S. Food and Drug Administration:
Most clinical studies also define a Completion Date, we can therefore look at their duration:
Next, we can look at countries where the clinical studies were conducted. We can see that almost half of all studies were conducted in the United States:
Next, we will focus on four eligibility criteria for participation — Minimum Age, Maximum Age, Gender and Healthy Volunteers. The last field indicates whether the study allows people who do not have the condition or related conditions or symptoms to participate in that study.
The minimum and maximum age criteria can be visualized together using a highlight table:
Next, we can look at which gender is requested and whether healthy volunteers are accepted:
Next, we examine conditions, the focus of each clinical study. The conditions of each clinical study are defined in two ways — by specific free-text labels and by one or more unified terms. The free-text labels are stored in the Condition entity table, most labels are rather specific and referenced only by one clinical study. The unified terms are stored in the Term entity table.
Here you can see an overview of the most used condition labels and terms:
We see that there is a significant overlap between the two entities — some condition labels such as HIV Infections or Hypertension have corresponding term identifiers. However, some common condition labels such as Healthy, Obese, Breast Cancer or Prostate Cancer do not have corresponding terms. Therefore, we decided to focus on the free-text label from the Condition entity in the next visualizations to determine the most studied conditions.
We can examine the development of the most frequent conditions by plotting number of their studies by year:
Finally, we will focus on clinical study interventions. Just like conditions, interventions are defined in two ways — by specific free-text labels and by one or more unified terms. The free-text labels are saved in multiple entities based on intervention type, for example Drug, Procedure and Biological. The unified terms are stored in the Term entity table. To see the distribution of the number of times a label or a term is referenced, we defined a custom SQL datasource using the following SQL queries.
Distribution of the number of times a drug label is used:
SELECT drug.title_en, COUNT(*)
INNER JOIN drug ON drug_id=drug.id
GROUP BY drug.title_en
Distribution of the number of times an intervention term is used:
SELECT term.title_en, COUNT(*)
INNER JOIN term ON term_id=term.id
GROUP BY term.id, term.title_en
We can also plot the number of studies per year for the top five most frequent intervention terms:
Finally, we can look at the most frequent interventions for each one of the most studied conditions by looking at the number of times they are referenced together by the same study:
In this tutorial, we provided a high-level demonstration of our tool by converting the ClinicalTrials.gov 45GB RDF dataset provided by the Bio2RDF project. We visualized the converted relational database using Tableau, demonstrating one of the possible benefits of automatic conversion of RDF data to the relational model.
Check out https://github.com/Merck/rdf2x for RDF2X usage instructions.
In our next article, we explore Wikidata using SQL by converting a music-related subset to a relational database.