How Bioinformatics Benefits from a Semantic Web of Linked Data

Caroline Idehen
OpenLink Virtuoso Weblog
6 min readOct 30, 2017

Recent years have seen an explosion of biological data produced and stored in disparate databases. Naturally, access, integration, and conceptual virtualization of this disparate data is of huge importance to practitioners associated with bioinformatics, biological and drug research, genetics, and clinical trials.

Custom reasoning and inference are less well known, but of no less critical importance to these communities. Thus, it is no coincidence that these communities have operated on the bleeding-edge of adoption and exploitation of Semantic Web of Linked Data technologies.

In this post, I showcase how EMBL-EBI (the European Bioinformatics Institute of the European Molecular Biology Laboratory) and UniProt (the Universal Protein Resource) collectively exploit the power of a Semantic Web of Linked Data, courtesy of Virtuoso’s ability to provide high-performance and scalable data access via SPARQL Query Services.

About EMBL-EBI and UniProt

The EMBL-EBI and UniProt projects provide massive high quality databases made up of RDF Language sentences/statements that describe entities (individuals), entity types (classes), and entity relationship types (relations) associated with bioinformatics and computational biology. Both projects deploy data in line with Linked Data principles while also providing SPARQL Query Service endpoints for ad-hoc query access.

EBI-EBML

The ability to query and explore a database published to the Web as a Semantic Web of Linked Data is powerful feature of the SPARQL Query Language. By design, you don’t have to be a query language master to take advantage of SPARQL’s sophisticated functionality, as the following simple example will demonstrate:

[1] Go to the default SPARQL Query Example on the EBI-EBML SPARQL Query Service.

Sample Query seeking information about all classes in the Virtuoso RDBMS that underlies the EBML-EBI Query Service

[2] Click on the “Submit Query” button to obtain query results, which will be shown further down in the same page.

Query Result (or Solution) in the form of a list of hyperlinks (HTTP URIs) that identify classes in the RDBMS

[3] As the query results indicate, the list of classes is so large that they span several pages. Luckily, the capacity to handle such a large result set is baked into the SPARQL Query Language; i.e., you can page through query results in either direction.

Snippet from page that lists 25 Classes from OFFSET 125 in the total Query Resultset

[4] At this juncture, you have a launch point for deeper exploration of the data without ever writing a single line of code. Basically, just “follow your nose,” based on items that pique your interest.

A Web Page about 5-fluorouracil, a medication which is used in the treatment of cancer

[5] If you’ve installed the OpenLink Structured Data Sniffer browser extension, you can also open up a Query Editor that allows you to tweak and re-execute the underlying SPARQL Query that generated a page, if need be.

Setting Query Result offset to 75

[5a] Here’s the new page that is generated when you click the “Run” button:

New Query Results Page scoped to offset=75 in the query solution.

UniProt

In similar fashion to EBML-BI, this project also provides access to a high quality database published as a Semantic Web of Linked Data. The table below provides insight into the magnitude of this database.

Latest SPARQL UniProt edition (release 2017_09)

The following steps demonstrate the kind of data exploration and integration that UniProt provides:

[1] Go to the default SPARQL Query Service endpoint.

Default SPARQL Query Service Page that includes a collection of sample queries

[2] Pick the first example from the list presented (on the right).

Selected example query — return a page listing all instances of a Class identified by up:Taxon

[3] Click “Submit Query” to get to the Query Results Page, where the requested list of instances will be presented in tabular form.

Tabular list of hyperlinks (HTTP URIs) that identify instances of up:Taxon

[4] You can use the ◀︎▶︎ to page through the results, if you like.

Query Results page that includes bi-directional paging.

[5] You can now start to explore the Semantic Web of Linked Data by clicking on any of the links presented

An entity description page for the up:Taxon instance selected from the query results page

[6] If you have installed the OpenLink Structured Data Sniffer browser extension, you can now open up a query editor and alter the query that drives the page or alter the HTTP request parameters, e.g., by setting a different offset value.

EBML-EBI and UniProt Conceptual Data Virtualization and Integration

Thus far, we’ve experienced the most basic data access and exploration features delivered by the EBML-EBI and UniProt databases.

We will now take this one step further, and look at how a Semantic Web of Linked Data offers powerful Data Virtualization and Integration, via the following steps:

[1] Go to the EBML-EBI SPARQL Query Service Endpoint.

[2] Select “FederatedQuery”, and “Query connecting Ensembl with UniProt endpoint” from the collection of example queries. Note the SERVICE clause highlighted in the image below, which identifies another SPARQL Query Service from which external data will be accessed.

Federated SPARQL Query Example where UniProt Query Service is referenced as a data provider in a EBML-EBI SPARQL Query

[3] Click “Submit Query” to execute this query.

[4] As in the prior examples, you will be presented with a query results page — but this time around, the data presented has come from both the EBML-EBI and UniProt databases, and has been semantically combined.

Federated SPARQL Query Results Page

[5] Click on any hyperlink in the query results (for instance, <http://purl.uniprot.org/uniprot/A0A024R752>), and you will be presented with a page that describes the selected Entity using values coherently sourced from both databases.

Entity description page that demonstrates conceptual data virtualization — i.e., a single entity description where attribute values originate from disparate data sources, coherently

Why is this important?

Data access, integration, and management are the biggest challenges to scientific research, and its requirement to produce timely and cost-effective outcomes (new drugs, cures, etc.) and insights to broaden knowledge frontiers.

By adhering to the guiding principles of Linked Data, and using RDF as their Structured Data Representation Language, both EBML-EBI and UniProt provide invaluable databases to a broad spectrum of practitioners.

By adopting Virtuoso for this function, both projects have successfully delivered on their fundamental goals without compromising performance or scalability, and deployed a Semantic Web of Linked Data that enriches the massive Linked Open Data Cloud.

Conclusion

A collection of open standards from the W3C are already in place that provide a powerful foundation for modern data access, integration, and management, as exemplified by the EBML-EBI and UniProt database initiatives.

Virtuoso simply ensures that possibility becomes reality without any distractions associated with cost, performance, scalability, or security.

Related

--

--