Building Bacterial Genome Language Model

Serge Mankovski
3 min readApr 28, 2019

--

The bacterial genome does seem to have a language to it, and this language is shared across a large number of genomes from the bacterial kingdom.

For a few recent weeks am have been exploring language models of the bacterial genome. I came across of a GitHub code repository shared by Karl Heyer applying ULMFit text classification technique to genomic sequence data.

ULMFit is a Deep Neural Network architecture and method developed by Fast.ai. It achieved state of the art results in text classification tasks circa early 2018. Karl Heyer’s Genomic-ULMFit repository on Github contains Jupyter Notebooks comparing performance of Fast.ai approach with other published results for the promoter, metagenomic, and enhancer classification tasks. Results presented by Karl Heyer in the notebooks demonstrate that ULMFit is competitive with other published methods sometimes exceeding them on classification accuracy.

The ULMFit method requires training of a language model employed for transfer learning in the classifier. A language model is a system that predicts the next word in a sequence of words. These models can be trained on a corpus of text and it does not require labels. Technically it falls into the category of unsupervised learning, but since the next word in a sequence can be used as a label, the labels are contained within the text itself. It is simply the next word in a sequence. For this reason techniques of supervised machine learning can be used. It is amazing that something as simple as that can capture properties of language it is trained on. Language models are extensively used throughout NLP.

In case of the genomic sequence, next “word” would next n bases in the genomic sequence. In respect to genomics, language models appeared in literature since 2015 in the context of quality control of genomic sequencing and as an instrument aiding sequence alignment.

Since the language model is required by the ULMFit method, Karl Heyer embarked on the building of several models for a set of 91 genomic sequences drawn from diverse families of bacteria taken from NCBI Genomic Assembly Database. The models were trained in two distinct ways — (1) as a universal model across all 91 genomes and (2) a model for a specific family of bacteria. Karl discovered that language model trained on sequences of the bacterial genome from the same family achieve on average two times better accuracy than universal language model. Language model built across multiple families of bacteria was training to the accuracy of 0.18 whereas the same model trained on genomes within the same family achieved accuracy of 0.36.

This result invites the conclusion that different families of bacteria share some genetic commonality, but bacteria families show significant divergence in language. However, it is not clear if the generic language model of bacteria kingdom has poor performance due to inherent differences between genomic sequences, or simply because of a small sample size. Despite the seemingly low accuracy of the language models, Karl Heyer demonstrated that they are useful for transfer learning within a family of bacterial genomes.

I became curious and set out to figure out what accuracy of the universal bacteria language model can be achieved using 13000 samples of the bacterial genome that I downloaded from the NCBI database. Training the model on larger corpus creates technical difficulties. A larger computing cluster with many GPUs and CPUs would help, but I do not have access to that. I used a gaming computer with i7 Intel CPU, NVidia RTX 2070 GPU and 32GB of RAM. I managed to train one epoch on 500 genomes within 6 hours.

By the time the training reached four epochs on a corpus of 4000 genomes, model accuracy improved from 0.18 to 0.26. The training took about two weeks with RTX 2070 GPU working at 99% utilization.

I did not reach the goal of training on full corpus downloaded genomes yet, but I made some observations along the way.

  1. Every time I added another 500 genome chunk to the training pipeline, accuracy of the resulting language model increased significantly. This contradicts the intuition that a variety of genomes is detrimental to the language model quality. The model seems to generalize well from one chunk to another.
  2. It is possible that training on the full set of 13000 genomes would achieve accuracy of models built for a single family of bacteria or even exceed it, just because of the sheer amount of data used in training.
  3. It is intriguing and worth exploring more. I am certainly planning to continue training these models when I have more time and computing resource available.

--

--