How Bioinformatics Benefits from a Semantic Web of Linked Data
Recent years have seen an explosion of biological data produced and stored in disparate databases. Naturally, access, integration, and conceptual virtualization of this disparate data is of huge importance to practitioners associated with bioinformatics, biological and drug research, genetics, and clinical trials.
Custom reasoning and inference are less well known, but of no less critical importance to these communities. Thus, it is no coincidence that these communities have operated on the bleeding-edge of adoption and exploitation of Semantic Web of Linked Data technologies.
In this post, I showcase how EMBL-EBI (the European Bioinformatics Institute of the European Molecular Biology Laboratory) and UniProt (the Universal Protein Resource) collectively exploit the power of a Semantic Web of Linked Data, courtesy of Virtuoso’s ability to provide high-performance and scalable data access via SPARQL Query Services.
About EMBL-EBI and UniProt
The EMBL-EBI and UniProt projects provide massive high quality databases made up of RDF Language sentences/statements that describe entities (individuals), entity types (classes), and entity relationship types (relations) associated with bioinformatics and computational biology. Both projects deploy data in line with Linked Data principles while also providing SPARQL Query Service endpoints for ad-hoc query access.
EBI-EBML
The ability to query and explore a database published to the Web as a Semantic Web of Linked Data is powerful feature of the SPARQL Query Language. By design, you don’t have to be a query language master to take advantage of SPARQL’s sophisticated functionality, as the following simple example will demonstrate:
[1] Go to the default SPARQL Query Example on the EBI-EBML SPARQL Query Service.
[2] Click on the “Submit Query” button to obtain query results, which will be shown further down in the same page.
[3] As the query results indicate, the list of classes is so large that they span several pages. Luckily, the capacity to handle such a large result set is baked into the SPARQL Query Language; i.e., you can page through query results in either direction.
[4] At this juncture, you have a launch point for deeper exploration of the data without ever writing a single line of code. Basically, just “follow your nose,” based on items that pique your interest.
[5] If you’ve installed the OpenLink Structured Data Sniffer browser extension, you can also open up a Query Editor that allows you to tweak and re-execute the underlying SPARQL Query that generated a page, if need be.
[5a] Here’s the new page that is generated when you click the “Run” button:
UniProt
In similar fashion to EBML-BI, this project also provides access to a high quality database published as a Semantic Web of Linked Data. The table below provides insight into the magnitude of this database.
The following steps demonstrate the kind of data exploration and integration that UniProt provides:
[1] Go to the default SPARQL Query Service endpoint.
[2] Pick the first example from the list presented (on the right).
[3] Click “Submit Query” to get to the Query Results Page, where the requested list of instances will be presented in tabular form.
[4] You can use the ◀︎▶︎ to page through the results, if you like.
[5] You can now start to explore the Semantic Web of Linked Data by clicking on any of the links presented
[6] If you have installed the OpenLink Structured Data Sniffer browser extension, you can now open up a query editor and alter the query that drives the page or alter the HTTP request parameters, e.g., by setting a different offset
value.
EBML-EBI and UniProt Conceptual Data Virtualization and Integration
Thus far, we’ve experienced the most basic data access and exploration features delivered by the EBML-EBI and UniProt databases.
We will now take this one step further, and look at how a Semantic Web of Linked Data offers powerful Data Virtualization and Integration, via the following steps:
[1] Go to the EBML-EBI SPARQL Query Service Endpoint.
[2] Select “FederatedQuery”, and “Query connecting Ensembl with UniProt endpoint” from the collection of example queries. Note the SERVICE
clause highlighted in the image below, which identifies another SPARQL Query Service from which external data will be accessed.
[3] Click “Submit Query” to execute this query.
[4] As in the prior examples, you will be presented with a query results page — but this time around, the data presented has come from both the EBML-EBI and UniProt databases, and has been semantically combined.
[5] Click on any hyperlink in the query results (for instance, <http://purl.uniprot.org/uniprot/A0A024R752>), and you will be presented with a page that describes the selected Entity using values coherently sourced from both databases.
Why is this important?
Data access, integration, and management are the biggest challenges to scientific research, and its requirement to produce timely and cost-effective outcomes (new drugs, cures, etc.) and insights to broaden knowledge frontiers.
By adhering to the guiding principles of Linked Data, and using RDF as their Structured Data Representation Language, both EBML-EBI and UniProt provide invaluable databases to a broad spectrum of practitioners.
By adopting Virtuoso for this function, both projects have successfully delivered on their fundamental goals without compromising performance or scalability, and deployed a Semantic Web of Linked Data that enriches the massive Linked Open Data Cloud.
Conclusion
A collection of open standards from the W3C are already in place that provide a powerful foundation for modern data access, integration, and management, as exemplified by the EBML-EBI and UniProt database initiatives.
Virtuoso simply ensures that possibility becomes reality without any distractions associated with cost, performance, scalability, or security.