SQL vs. TypeQL: Modelling and Querying of Biomedical Data

Writing 151 SQL lines in 4 lines

Tomas Sabat
Vaticle
8 min readDec 4, 2018

--

Using SQL to query relational databases is easy. As a declarative language, it’s straightforward to write queries and build powerful applications. Having been around for many decades, SQL has developed into a language that is extremely robust, reliable, fast and useful for applications of many sizes.

However, SQL struggles when working with complex data. By complex data, I mean data that contains many entity types and is highly interconnected. When querying such data in SQL, challenges arise in the modelling and querying of the data. For example, due to the large number of necessary JOINs, it forces us to write long and verbose queries. Not only do such queries take very long to run, but they are also difficult to write and prone to mistakes. When working with complex data in SQL, we can list the challenges as follows:

  1. Complex to write queries — analysing hierarchical data with many relationships leads to SQL statements containing many JOINs which easily become too difficult to understand.
  2. Slow query speed — as a consequence, queries with a large number of JOINs create complexity in the computing response and lead to high query response times.
  3. Complicated data models — domains with complex data require non-intuitive data models with (join) tables that create unnecessary complexity and reduce data quality.

The above challenges are especially pertinent when working with biomedical data (see more here). This is exactly the type of data that becomes difficult to model and query in SQL. It contains many entity types (e.g. various of proteins, genes, tissues) and many connections between those entities (interactions, associations, etc). This is why using a query language such as TypeQL can reduce the modelling and querying complexity by orders of magnitude (more on that below). That’s why in this article, I want to look at how TypeQL compares to SQL when working with biomedical data.

At the end of this article, I’ll show you how we can write a 151-line SQL query in just 4 lines in TypeQL.

An Introduction to TypeQL

If you’re unfamiliar with TypeQL, I want to first give a brief high level overview of the modelling constructs (see here for an in-depth introduction).

Just as SQL is the standard query language in relational databases, TypeQL is TypeDB’s query language. It’s a declarative language, and allows us to model, query and reason over our data. Crucially, it’s TypeQL’s expressivity that makes it the most readable query language in the world. This is really important because it means even non-programmers (such as biologists rather than bioinformaticians) are able to talk to the database and make sense of it.

TypeDB’s data model consists of three core types:

  • Entities are the main actors in your domain. These are usually the type of things you want to know about (e.g. proteins, genes, tissues, diseases)
  • A Relation describes how two or more things are in some way connected to each other. Each relation must relate to roles that are played by something else in the domain. In other words, relations are dependent on the existence of at least two other things (e.g. protein interactions, process localisation, transcription).
  • Attributes are small pieces of data that get attached to other concepts (e.g. uniprot-id, entrez-id, disease-name).

Below is how these concepts can be used to create a simple model (or schema). The example below specifically shows us how genes encode proteins and are associated to diseases.

Simple TypeDB model: Green nodes are entities, purple diamonds relations and red boxes attributes.

Finally, Rules are logical patterns that we encode in our model which allow us to reason over our existing data to create new instances of entities, relations and attributes. Effectively, this means that by imposing a certain logic over existing data, we can answer exponentially more questions from the same data.

For example, if our original data doesn’t include protein-disease relations, we could infer these protein-disease relations by looking at the protein’s encoding genes and connecting it with the diseases to which they are associated to. In other words:

If:

  • Genes encode proteins;
  • And genes are associated to diseases.

Then:

  • Create protein-disease associations.

This rule would be part of our schema, which we could then visualise as follows, where the dotted line represents the inferred relation:

The dotted line attached to the diamond shape represents the inferred relation.

SQL and TypeQL: Data Modelling

With this basic understanding of TypeQL (and assuming you understand SQL), let’s compare how we would represent a more complicated model in both TypeQL and SQL, and let’s find out which one is more intuitive.

For this comparison, I want to specifically look at disease networks. The entities I want to model include:

And the corresponding relations:

  • Genes encode proteins
  • Genes are associated to drugs
  • Genes are associated to diseases

Finally, I want to include sequence alignment data from a BLAST search (more on that here) that we previously did. For my model, I want to integrate the positivity and identicality scores that BLAST returns between match and target sequences. We model this as relations between sequences:

  • Target protein sequences have positivity and identicality scores with match protein sequences

Pictured below is how we would represent this domain in a relational model.

Relational model of our disease network. Yellow tables are join tables.

There are a few things to note:

  • There are three join tables to connect genes to diseases, drugs and proteins.
  • The sequence alignments are also modelled as a join table, where I used protein sequences as foreign keys to connect them with proteins. In this table I also added positivity and identicality scores.

Now that we’ve modelled this disease network in a relational model, how would we go about doing this in TypeDB? See below.

Visual representation of the TypeDB’s schema of the disease network domain. Green nodes are entities, purple diamonds relations and red shapes attributes.

It’s important to note that as TypeQL is so much more expressive than SQL, when modelling any domain in TypeQL, we need to think differently if we’re used to modelling in a relational world. For example, we should take into account TypeQL’s type hierarchy, and think of concepts which we later want to infer through the use of rules. With that in mind, for my model:

  • I sub-typed four protein types (kinase, ion-channel, nuclear receptor and GPCR), and created a parent type protein
  • I created a relation between two sequence attributes called sequence-sequence-alignment, and on that relation, I attached identicality and positivity attributes
  • I also created a protein-protein-alignment relation, so that I can create rules that infer this relation
  • Finally, I created a drug-disease-association relation, so that I can also infer this relation

SQL and TypeQL: Query Examples

Now that we’ve created these models, how would we go about querying these in SQL? And how would this compare in TypeQL?

First, let’s start with a simple question.

  1. Which drugs interact with the gene Ensembl-id ‘ENSG00000010404'?

SQL query:

TypeQL query:

Okay, it looks like TypeQL is more intuitive to read. Let’s look at a slightly more complicated query.

2. Which Kinases are associated to Diabetes?

SQL query:

TypeQL query:

The TypeQL query is two times shorter than its SQL equivalent!

Now, bear in mind that for this TypeQL query we leverage the rule that I previously created to infer protein-disease-associations if both entities are connected to the same gene. This allows us to ask a higher level question, and let the system figure out how to do the navigation. In SQL, as it doesn’t have a reasoning engine, we need to write the explicit query by doing three JOINs.

Finally, let’s look at an extreme example.

3. Which drugs are associated to the disease Asthma?

SQL query:

TypeQL query:

Wow! That’s quite a big difference — the TypeQL query is 37 times shorter than its SQL equivalent. Now, please note I’m really not trying to take the piss out of SQL! And I’m sure it can still be optimised by someone highly experienced in it. That being said, I hope to show here how an average person (such as me) would go about writing this.

Let me expand how we reduced a 151-line SQL query into a 4-line TypeQL query. Due to the underlying data structure and the logic we want to apply, the query itself leads to a combinatory explosion due to the number of permutations of the query. TypeQL’s type-based and rule-based reasoning abstracts this away, allowing us to ask the higher level query.

In the original data ingested, no drug-disease relations exist. Therefore, in order to answer this question, we want to apply the following logic:

If:

  • If diseases are connected to genes (in our example above we specifically asked for the disease Asthma);
  • And genes encode proteins (kinases, ion channels, nuclear receptors, GCPRs);
  • If any of those proteins has a sequence alignment with another protein containing a positivity higher than 0.98 and an identicality higher than 0.98,
  • and that protein is encoded by a gene,
  • which is associated to a drug

Then:

  • Create an association between asthma and the drug.

I stored this logic in the form of a rule (see below) in my TypeDB schema. This means that subsequently I can just query for drug-disease-relations and the TypeDB will figure out how to do the navigation for me.

The code of the rule that encodes the logic that we can later leverage in order to compress the SQL query.

However, as SQL doesn’t support reasoning, writing the SQL query results in the combinatory explosion that you saw above. Specifically, this is because there are four types of proteins that contain sequences that can either be a target-sequence or a match-sequence in a sequence-alignment relation connected to any of the four other protein types. In SQL we have to explicitly state all the paths that the query can take, whereas in TypeDB we just reason over our data, using rule-based and type-based inferencing.

Visualisation of the underlying logic that finds us drugs associated to Asthma. The green node is Asthma, the purple node is a drug, yellow nodes are genes, green nodes are protein types, and blue nodes are sequences (screenshot made using Workbase).

Final Words

SQL is undoubtedly the most battle-hardened query language in the world. And to this day powers most of world’s IT infrastructure. However, when it comes to handling complex data, SQL struggles:

  1. It’s complex to write queries;
  2. it leads to slow query speeds;
  3. and forces us to create complicated data models.

Fundamentally, these challenges are inevitable as the relational data model was not created to work with this type of data. Using a much more expressive query language that reasons over our data abstracts away this complexity so that we can focus on the higher level question.

If you and your development team face these types of challenges while working with SQL, using TypeQL will be inevitable. Luckily, it’s incredibly intuitive and easy to learn, so you’ll get up to speed in less than no time.

If you have any questions, comments or would like to collaborate, please shoot me an email at tomas@vaticle.com or tweet me at @tasabat. You can also talk to us and discuss your ideas with the TypeDB community.

(thanks to Soroush Saffari for helping with this article especially creating the SQL queries!)

--

--