Turn Neo4j into a Genome Browser

Gene cluster finding, annotation curation and seqeunce management all in one

Sixing Huang
CodeX
9 min readJun 7, 2021

--

If the 21st century is the Age of Biology (1), then genome sequencing is the harbinger. Genome sequencing basically turns DNA molecules into texts in computers. DNA sequences are stored in simple ASCII text files such as Fasta and Fastq. Biologists then run programs over them to discover proteins (open reading frame predictions). The functions of these predicted proteins can be guessed via their close relatives in the databases (protein annotations, please read my article here and here for more information). Biologists save all these works into GBK or EMBL files like this:

Figure 1. Example of an EMBL file. Image by author.

GBK or EMBL files are still just text files. Their data are in plain view, and that is the only advantage that I can think of. Biologists can import the files into tools such as the classic Artemis for gene visualization. But Artemis is only a GUI program and not a query engine. So it does not allow biologists to perform queries such as:

  1. show the GH13 protein sequences in the genome;
  2. show the five neighbor genes around each GH16 gene;
  3. show all the TonB-dependent receptor and SusD-like protein gene pairs as long as they are less than 10,000 base pairs apart.

In the age of big data, it is a common practice to ingest text files into data warehouses where they are stored, versioned and analyzed. Biologists turned to systems such as GenDB for the task. GenDB, a monolithic piece of software from 2003, relies on a complex web of relational tables to manage genomes, genes and annotations. GenDB highlights the drawbacks of relational databases in genome analyses. Firstly, GBK and EMBL files are semi-structured data and their data fields can vary among different submitters. This makes a rigid schema design difficult. The designer needs to change the schema frequently to accommodate the new fields. Or they resolve to create new tables and keep the old ones intact. As a result, gene properties are scattered on to many different places so that even a simple query requires several “JOIN”. Secondly, relational databases break the linear relations of genes into table entries. This makes it hard to formulate gene cluster finding queries. Last but not least, it scales poorly for large projects and it is not cloud-ready.

Bioinformaticians can certainly take matters into their own hands by crafting some Biopython Jupyter notebooks to answer those questions one by one. But that turns each query into a programming project and shuts many biologists out. Furthermore, this is not a long-term data warehouse solution.

Wouldn’t it be great if we can have a database that:

  1. allows the import and export of the EMBL or GBK files;
  2. reflects the linear relations among genes;
  3. stores gene properties in a NoSQL manner instead of tables;
  4. visualizes the query results;
  5. offers a GUI for manual data editing;
  6. has an easy but powerful query language;
  7. is scalable and cloud-ready.

Although the NoSQL databases can check several items in this list, they do not have a built-in mechanism to represent the linear relations among genes. Moreover, NoSQL returns query results as documents without any graphical representation. That is a deal-breaker for many biologists because they need to inspect the results visually.

The graph database Neo4j fits the bill. Readers of my previous articles (here, here and here) may have already known that Neo4j can power many analytic projects in the bioinformatics. In this article, I am going to show you that it can also serve both as a genome browser and a warehouse. EMBL files can be imported and exported. Graph is a natural and intuitive way to represent the gene arrangement in genomes. Gene properties are stored as key-value pairs inside each node. The graph app Neo4j Commander is a great GUI editor for manual gene annotation. Neo4j uses the simple and yet powerful Cypher language. With it, we can formulate complex queries in really simple statements. Its enterprise version contains Fabric that can shard the graph data for scalability. And for cloud-native users, Neo4j offers a Database-as-a-Service (DaaS) option Aura and some users can host their own database easily with virtual machines in the cloud.

The code for this project is hosted on my Github repository here.

1. Data preparation and import

I only use two genomes for this demonstration: the complete chromosome of Halorhabdus tiamatea SARL4B by Werner et al. and the complete genome of Formosa agariphila KMM 3901 by Mann et al.. Both EMBL files are converted into JSON formats via my script. You can find both the original files and the JSON files in my repository.

Create a project in Neo4j Desktop and enable the APOC plugin. Use the Open folder -> Import menu to open the import folder. Put the JSON files into the folder. Then go up one level to the conf folder, create a text file called apoc.conf and paste these two lines into it:

So your project folder should have a file structure like this:

Figure 2. File structure for this project. Image by author.

Now run the following commands to import the data:

The database has the following simple data structure (schema):

Figure 3. Schema for this project

The Contig node contains all the EMBL metadata, including taxonomy, publication and sequencing information. The Lead node is the first feature in the EMBL file, the source feature. The Gene nodes represent the genes and contain key-value pairs about their functions, genome coordinates and translated protein sequences. Below are two concrete examples:

Figure 4. The first four nodes for each of the two genomes. Image by author.

2. Genome statistics

Once the data are inside Neo4j, it is quite easy to calculate some complex genome statistics. For example, the following query counts the amount of GH in the chromosome of Halorhabdus tiamatea:

The query returns “42” and it agrees with the number reported by Werner et al..

It has been reported that the amino acid usages are different in AT- and GC-rich microbial genomes. We can investigate the raw amino acid counts between the two genomes like this:

The query uses the REDUCE function to aggregate the results across the two genomes. They clearly show that the two branched-chain amino acids leucine (L) and isoleucine (I) are more frequent in Formosa than in Halorhabdus. The two genomes do have something in common: the two sulfuric amino acids: methionine (M) and cysteine (C) appear at the bottom of the list for both.

3. Finding gene clusters

In genomics, the analysis of individual genes is of course important. But sometimes the analysis of gene clusters is more revealing. Gene clusters are genomic regions where functionally related genes are co-located. An operon is a classic gene cluster. Genes within an operon are transcribed and regulated together. Another example is the polysaccharide utilization locus (PUL). It is an ensemble of Carbohydrate-Active enZYmes (CAZymes, read this for an introduction) genes and the membrane transporter genes. Their gene products can depolymerize and mobilize carbohydrate molecules and they thus play a key role in the biorecycling of polysaccharides in our ecosystem.

Finding gene clusters is where Neo4j really shines. For example, the following query searches for gene clusters that are bookended by a SusD-like protein and a TonB-dependent receptor as long as the fragment is shorter than 10,000 base pairs in Formosa agariphila.

Figure 5. TonB-dependent receptors in F. agariphila. Image by author.

The query quickly retrieves 18 such clusters. The TonB-dependent receptor and SusD-like protein together form a transporter that allows the influx of glycans and other molecules. They are PUL components. The amount of such clusters within a genome gives hints about the polysaccharide degradation ability of the bacterium.

Isolated from the green algae Acrosiphonia sonderi, the bacterium F. agariphila can degrade many polysaccharides not only from the green but also from the brown algae. The GH16 CAZymes are responsible for many of its agarolytic activities. And GH16 genes often reside with other CAZymes or TonB-dependent receptors to form PULs. The query below will return all the GH16-centric gene clusters with five neighbors on each side in F. agariphila:

Figure 6. The five GH16 centric gene clusters in Formosa agariphila. Image by author.

The command returns five PULs. Three of them have exactly eleven genes. The fourth has twelve because two GH16 CAZymes are direct neighbors. Finally, a super PUL with 35 genes is revealed. This super PUL contains four GH16 CAZymes: two beta-porphyranases and two beta-agarases. In addition , GH2, GH28, GH29 and GH117 all are present twice in the vicinity. Because algal polysaccharides are sometimes sulfated, F. agariphila has to remove the sulfate group with sulfatases before their degradation. There are four sulfatases in the cluster. Finally, the TonB-dependent receptors, SusD-like proteins and ABC transporters are also scattered within the fragment.

In summary, these clusters contain many carbohydrate degradation genes that characterize the agarolytic lifestyle of F. agariphila. This query also shows us the scale of the enzymatic teamwork behind a simple “algal polysaccharide degradation” activity. Some further research into these clusters can deepen our understanding of the process or even lead to industrial application.

4. Data Editing

Neo4j not only displays information, it also allows editing. This is especially important for biologists because they often need to manually correct genome annotations. The app Neo4j Commander in the App Galary is a handy tool for this purpose.

Figure 7. The Neo4j Commander app adds editing functionality to Neo4j.

You can follow the instruction in Neo4j Desktop to install the app. Once installed and opened, we can add, remove and edit the gene properties in its GUI.

Figure 8. Gene annotation editing in Neo4j Commander.

As shown in Figure 8., we can first use a query to locate the gene entries. Then we can click on the property fields and start editing. And the changes will be written back to the database when we click “SAVE”.

5. Export

Finally, our Neo4j genome warehouse can export the sequences into Fasta or EMBL files as well. And it even supports cross-genome export. For example, if we want all the GH3 sequences of both H. tiamatea and F. agariphila for a comparative genomic analysis, the following command returns them in one go:

Figure 7. GH3 sequences are retrieved by a simple query. Image by author.

The query above capture the “GH3” keywords in product names via regular expression. We can export the results into a Json file by wrapping the query inside the apoc.export.json.query function:

The Json file is located in import folder as well. It can then be converted into a Fasta file via a Python script from my repository.

We can also export the genome as a whole back into an EMBL file after manual curation:

This command writes a Json file. My Python script json_2_embl.py can then convert it into an EMBL file.

Conclusion

This article demonstrates how Neo4j can become a powerful genome browser and a data warehouse at the same time. Currently, online databases such as the National Center for Biotechnology Information (NCBI) or the European Bioinformatics Institute (EBI) still host genome text files. Their websites allow some simple full-text searches. But complex queries such as the ones presented in this article are not yet possible there. If the two institutes can use Neo4j as their backend warehouse, users can execute more intricate queries and perform comparative analyses on the server side. In fact, they can even store metagenomes in Neo4j with the same design. I am convinced that this transition will accelerate new discoveries in genomics and metagenomics.

However, my Neo4j design is not perfect. I have attempted to store several MB worth of DNA sequence into the Contig nodes. To my surprise, the bloated Contig nodes degraded the database performance severely. Therefore I have decided to omit the import of the DNA sequences into the graph. That is the reason why the json_2_embl.py script needs the original EMBL file. I hope that the future version of Neo4j can address the performance issue.

Do you also use Neo4j for your genomic research? If yes, please share with me about your experience.

--

--

Sixing Huang
CodeX

A Neo4j Ninja, German bioinformatician in Gemini Data. I like to try things: Cloud, ML, satellite imagery, Japanese, plants, and travel the world.