Discovering bacterial histones: Wrapping our heads around dogma-defying surprises in nature

Basecamp Research
3 min readFeb 8, 2023

Previously thought to be firmly a eukaryotic feature, two weeks ago Researchers from London and the US published that they had found evidence of histone proteins in Bacteria.

We wanted to know how many more examples exist in nature like this. So we asked our huge metagenomic database of global biodiversity. Turns out, lots!

BaseData: Our proprietary knowledge graph.

In our last medium article, we commented on the rapid progress in protein AI: Whether it’s the usage of diffusion models for protein design, exciting applications thereof for docking, such as DiffDock, or making protein design more controllable through conditional language models.

There is one actor in all of this who has been keeping us on our toes at least as much as the progress in protein AI — and that is nature itself: For decades, histones which wrap DNA around themselves to form chromatin, were thought not to exist in bacteria — that is until two weeks ago. A preprint published in January this year demonstrated that there are in fact histones that encase DNA in bacteria, providing structural evidence for this. A commentary in Nature called this a ‘dogma-defying discovery’, because it is just that.

A lot of the dogmas we are being taught in the Life Sciences and conventions we base our scientific process on are a result of the discoveries we have made on how life on earth works. But these conventions only hold true on what we have discovered so far. It’s therefore perhaps worth pointing out what we haven’t discovered so far. Estimations for the number of species on our planet go as high as 1 trillion — compare that to two thirds of all microbial genomes on ncbi originating from just 12 species.

Basecamp Research’s BaseGraph™ has been built to bridge this discrepancy: Beating the size of public databases and 4x greater diversity, it links novel genomes and proteins to their chemical and biological environments in more than 2.5 billion relationships. Going back to the ‘dogma-defying’ bacterial histones we were therefore excited to see the large diversity of these newly characterised proteins in our knowledge graph, originating from more than 100 different bacterial assemblies in addition to the 2 described in the preprint. Below we show a subsection of the histones in our knowledge graph alongside a structural superimposition of one of our bacterial histones to the reference protein mentioned in the preprint.

A: Representative subset of BaseGraph displaying histone ORFs (yellow), the genomes they originate from (purple) and taxonomic annotations (cyan). B: Superimposition between 1 of the >100 bacterial histones from BaseGraph (green) and the reference protein Bd0055 from Hocher et al. (2023, orange). Seq ID% is 49%, RMSD = 0.4 Angstrom.

This discrepancy between the unimaginable vastness of undiscovered biodiversity, genomes, and (protein) sequence space, versus the level of redundancy of what we have seen so far means that we are probably in for a lot more ‘dogma-defying’ surprises. It also means that tying the development of AI applications in the Life Sciences, which relies on the sparse data we have so far, more closely to the discovery process itself is key. With a 6x growth rate in the size of our knowledge graph during the 2nd half of 2022, and expeditions across multiple continents, Basecamp Research is exactly on this mission.

--

--