Analyzing Genomes in a Graph Database

Sixing Huang
Geek Culture
Published in
11 min readMar 10, 2021

--

A genome is the sum of all genetic materials inside an organism, be it a virus, a bacterium, or a human. It is the book of life written in just four letters: A, T, C, and G. However simple is its alphabet, this book contains all the instructions needed for its owner to reproduce and to survive. Just like a book with many sentences, a genome contains many genes. With these genes, the cell can manufacture proteins to construct itself and fulfill various biochemical functions. Therefore, scientists can learn a lot about the organism by sequencing and analyzing its genes in its genome.

Scientists have figured out the functions of many genes through experiments. It is assumed that genes with similar DNA sequences shall fulfill the same function. So scientists group genes with similar sequences into clusters and name them in an ontological way. One of the ontologies widely used today is Kyoto Encyclopedia of Genes and Genomes (KEGG). Gene clusters are labeled with KEGG Orthology (KO) numbers such as K00001 or K02392. By labeling each gene with KO number in a genome, we can quickly get a sense of what the organism can do biochemically.

In fact, we can do more than that. Since a genome can contain many KO numbers and a KO number can appear in many genomes, we can on the one hand identify similar genomes if they share many common KO numbers (Figure 1). On the other hand, we can deduce the evolutionary history of a KO cluster based on where it occurs in the tree of life (Figure 2). However, scientists store the genome and KO information in relational databases such as MySQL, which makes it difficult to get the answers because it involves lots of “Join” operations between tables. In this case, the graph database comes to the rescue.

Figure 1. Use KO to identify similar genomes. In this case, Genome 1 and 2 are similar because they share more KO than Genome 3.
Figure 2. With phylogeny it is possible to infer the evolutionary history of KO clusters. In this case, K00001 appeared earlier in the evolution than K02032, because the latter is only shared by two close relatives while K00001 is even found in a remote genome.

In a graph database, data points and their relations are stored as nodes and edges, hence graph. Data are essentially represented as a network (Figure 3). It is much eaiser to formulate relational queries in a graph database than in a relational database. In fact, graph queries are structured very much in the form of “node-edge-node”. There are several popular graph databases on the market, such as AWS Neptune, DGraph and Neo4j. Among them, Neo4j is open-source, relatively mature and easy to use. It uses the Cypher query language.

Figure 3. A simple genome analysis in Neo4j

Today in this tutorial, I will show how to store the KO annotations of all the Proteobacteria on the KEGG website in Neo4j and use Cypher to get some quick insight into the data. The taxonomy of each genome is also modeled as connected nodes in the graph so that we can do some evolutionary analyses. All the code and data (accessed at 2021.01.19) for this project is hosted in my Github repository:

1. Prepare the data

First off we need to grab the KO annotation of all the genomes via the KEGG API. For this project, we need both the genome and each of the KO detail information. I have written a Python script to download them into two folders. Secondly, we need to process all those data into four CSV files. They are 1. the taxa CSV, 2. KO detail CSV, 3. the connections between taxa and 4. the connections between KO and genomes. The goal is to establish a network that models both the taxonomy and the KO-genome relationship like in Figure 4.

Figure 4. This tutorial aims to establish a network that models both the taxonomy and the KO-genome relationship.

For this tutorial, I only investigated the proteobacterial genomes. Proteobacteria is one of the most diverse and abundant bacterial phyla. It contains members such as Escherichia coli, Rhizobium leguminosarum and Legionella pneumophila that are highly relevant to our daily life.

2. Import data into Neo4j

Now it is time to move onto Neo4j. At the moment, there are at least four ways to start a Neo4j instance. They are “Community server”, Docker, “Neo4j Desktop” and Neo4j Aura. The first three do not cost money to run locally while the last one is a fully managed cloud service.

For this project, any of the local solutions suffice. First, create a project called “kegg”.

Figure 5. Create project

It is a specialty in Neo4j that only files in the “import” folder inside the database installation can be imported. Here in Neo4j Desktop, I can open it by clicking “…” -> “Open folder” -> “Import”. Once opened, copy all four CSV files there.

Figure 6. Import folder

Once ready, we click “Start” and a Neo4j Browser opened. We can run commands in the upper console area. Issue the following commands to import the taxonomy, KO and their relations into the database. They created two types of nodes: the “taxon” (the red ones in Figure 4) and “ko” (the blue ones in Figure 4) nodes. The higher taxa point to their next level lower taxa by the “has_taxon” relation. The genomes point to their KO with the “has_ko” relation. The “CREATE CONSTRAINT” commands are necessary because they accelerate the import significantly by declaring all the nodes unique.

The last command took some time. If you encounter an error “Java heap space” during the last import, you can uncomment the “dbms.memory.heap.max_size” and set it to 4G back in the neo4j.conf and the error should go away.

3. Get the overall statistics

Now the fun part begins. We can first have some statistics over our graph before we dive into more detailed case studies.

To get the total count of taxa in our graph, issue this command:

It returned 4131. That is the total number of taxon nodes. Notice the syntax is quite similar to SELECT in SQL. The “t” is an alias, while :taxon is the node type. However, those 4131 include everything from the “Bacteria” taxon, “Proteobacteria” taxon down to every genome. To only get the genome counts, issue:

This query gives you the first glance of the power of Cypher. The MATCH part itself succinctly reflects the relation in our data, that is, a taxon has kegg. So every node that satisfies this relation is returned. Unlike the last query, the “DISTINCT” keyword here is necesary because Cypher will double count the genomes repeatedly every time it goes through a genome -> kegg pair.

Similarly, to get the total counts of KO numbers, issue the following command:

Next we can calculate which KO numbers are the most common:

The k.ko_id and k.name are there to limit the output to these two columns. The query returns

As you can see, the list is dominated by ribosomal proteins. The only outsider that makes it into the top 10 is RNA polymerase subunit alpha. How about the rarest KO? Just by removing “DESC” in the last query, you will get

At first glance, the top 10 rarest KO numbers do not seem to have any pattern. In fact, there are 566 KO numbers that are found only once in one genome. They can be quickly retrieved by issuing:

This fequency list is very useful for future genome analyses. Given a new genome and its KO annotations, scientists often investigate its rare KO first instead of the common ones. To do this, they can simply look up the frequency of each its KO in the list and sort the results. By the same token, the genome frequency list can highlight which taxonomic groups were well sequenced and which awaited more attention. In summary, these frequency lists are genomic scientist’s guideposts. For example, the following query will list the top 20 rarest KO in the genome of Bdellovibrio exovorus JSS:

which returns:

As you can see, in our Proteobacteria-only dataset, Bdellovibrio exovorus JSS is the only genome that encodes the K21148 “[CysO sulfur-carrier protein]-thiocarboxylate-dependent cysteine synthase”. Following the list, it shares K05665 with Bdellovibrio sp. qaytius and Silvanigrellales bacterium RF1110005, two relatives. The third KO, K12351, JSS shares it with a relative Bdellovibrio bacteriovorus W. But curiously, the third owner is a not-so-close Arcobacter canalis, an Epsilonproteobacteria. Finally, JSS encodes the rare DNA topoisomerase VI.

4. Core genome analyses

The task of finding similar genomes can be boiled down to this: find out genomes that share most of their KO. These shared subsets of KO can be called their core KO or core genomes. Conversely, KO that are unique to only some genomes are called unique KO. Core KO and unique KO together form the pan KO or pan genome.

Figure 7. Core, unique and pan genomes

It is surprisingly easy and quick to generate the core and unique KO among genomes in a graph database. For example, the following command can quickly find all the core KO between Rhodopseudomonas palustris BisB18 and Bdellovibrio exovorus JSS.

Alternatively, with a slightly modified query, we can put the result into a visualization via Neo4j Bloom.

Figure 8. All the common KO between Rhodopseudomonas palustris BisB18 and Bdellovibrio exovorus JSS

With this query under our bell, it is quite easy to generate the unique KO for a genome now. Let’s say we want to investigate the genome of Chromobacterium sp. ATCC 53434. This bacterium belongs to the genus Chromobacterium and it has five other sister genomes in our graph. We want to know which KO are present in ATCC 53434 but not in the others. To do this, we can first generate a list of all KO in those sister genomes as “filter_list”. We then iterate each KO in ATCC 53434 and only return those absent on the filter_list. The Cypher version of this description is:

The “WITH” statement is just like the “|” operator in Unix console. It prepares the results from the first “MATCH” as input for the second “MATCH”. The “COLLECT” command just formats the results (DISTINCT k) into a list so that it can be used by the “IN” inclusion test in line 5.

This query quickly returns a list of 50 KO, below is the truncated version of them:

The results are highly interesting. Firstly, ATCC 53434 curiously has yersiniabactin related KO. Yersiniabactin, as its name suggests, is a siderophore (an iron carrier molecule) found in Yersinia or other enteric bacteria such as Escherichia coli and Salmonella enterica, which all belong to the Gammaproteobacteria. But ATCC 53434 is a member of the Betaproteobacteria. Secondly, the list contains K12241 and K12242 that apparently are related to pyochelin biosynthesis. Pyochelin is also a siderophore. Thirdly, K10829, K23227 and K23228 apparently are involved in the ferric hydroxamate transport, in other words, iron transport. Lastly, K13255 encodes ferric iron reductase protein. In summary, this list suggests that ATCC 53434 differs from its five sister Chromobacterium by having a unique repertoire of iron related transport proteins. As bonuses, the results also suggest that this bacterium uniquely has all the three subunits of acetone carboxylase (K10854, K10855 and K10856), two-component system (K18143 and K18144) and multidrug efflux pump (K18145 and K18146).

Also it is quite easy to find out how similar are these sister genomes to that of ATCC 53434 by sorting the total numbers of their shared KO:

And it returns:

Considering that ATCC 53434 has 1992 KO in total, it appears 93% of them are shared by the closest relative Chromobacterium vaccinii. That is around 7% more than the farthest relative Chromobacterium sp. 257–1.

Currently, there is little information about ATCC 53434. Nothing can be gleaned online about its habitat, phenotype and pathogenicity. So our “one query” analyses above can quickly generate some interesting hypotheses about its phylogeny and physiology. Certainly, only in experiment can such hypotheses be validated. But this hypothesis exercise can greatly focus our experiment and thus saves time and energy.

5. Visualization in Neo4j Bloom

Although Neo4j Browser has a built-in visualization output, it is quite limited in effect and styling. Worse, it hits performance bottleneck fairly quickly. Forgetting “LIMIT” in a query means a freeze because a large number of nodes can all appear on the output cell at once.

Previously, Gephi is the way to go for rendering a large Neo4j graph. But luckily, Neo4j Desktop now comes with Neo4j Bloom. Bloom can take over many of the rendering tasks. In addition, it provides some graph analysis functions such as shortest path between nodes. To generate Figure 8, open Neo4j Bloom in the Neo4j Desktop. Then click “Create Perspective” -> “Use Perspective”. In the “Perspective” sidebar, activate “Search phrases”.

Figure 9. Create a Search phrase in Neo4j Bloom.

In “Search phrase” enter “common KO between Rhodopseudomonas palustris BisB18 and Bdellovibrio exovorus JSS”; In “Cypher query” enter:

Then click “Save” and fold the sidebar. Now start typing in the search input “comm…” and you could use the autocomplete hint to select the phrase we just created. Press “enter” and you should see the interactive graph in its full glory. Finally, drag and drop the nodes to appropriate places and set the styles. Voila, you get a nice visualization of the graph.

A word about network visualization: a visualization with too many nodes and edges is going to nothing but a decoration picture, because it is too dense to show labels. No labels, no details. Therefore, setting a clever limit on the numbers of nodes and edges is the key.

Conclusion

This tutorial is quite long. But it is still too short to demonstrate all the power in Neo4j. I haven’t touched graph algorithm and graph machine learning. Also, its potential in genome analysis is huge. For example, we can do systems biology in a graph database because graph is a more natural way to model biochemistry than tables. Also, we can store our biological knowledge graph in Neo4j. We can even make it into an Arrowsmith, a tool that can connect two seemingly unrelated concepts in the biological literature, such as between magnesium deficiency and migraine, as told by David Epstein in his book “Range”.

So, please try to use graph database to supercharge your research today!

--

--

Sixing Huang
Geek Culture

A Neo4j Ninja, German bioinformatician in Gemini Data. I like to try things: Cloud, ML, satellite imagery, Japanese, plants, and travel the world.