Note: This article has been updated after bug fixes were done the day after I reported this on the mailing list. Please refer to the update at the bottom.
A well known method for indexing and retrieval of full text is the suffix tree, otherwise known as a patricia trie or radix trie. Among the many potential applications of the trie and suffix trie/tree they are especially useful in bioinformatics for efficient searching with mismatches. Being in a proteomics lab, suffix tries are a perfect data structure for indexing and searching proteomes today. Desktop computers can be equipped with 32gb of memory for little cost, which happens to be just enough to store a fast index of the entire proteome in memory. This allows for a one-time indexing and subsequent constant time searches.
Since I use mostly python in my scientific work, I set out to find a python library to use to create a suffix trie for my proteome. Pure python implementations of suffix tries are not memory-efficient enough, so most trie libraries are C/C++ with python wrappers. I recently came across a blog of several advanced data structure implementaions in python and decided to take the time to replace my suffix array implementation with a suffix trie. Here is a collection of tries that I attempted to use and the results.
BioPython For some reason, this implementation serializes to disk fine, but throws an exception on loading. Also, there’s a bug in the with_prefix function which I need to search the suffix trie with. I’m working on patching this myself.
CharTrie This library only supports strings as keys and ints as values. I mapped proteins to ints so that I could use this library. I couldn’t get my proteome to fit in memory with this library.
DATrie Insertions into this trie was too slow. For some reason when I got to the 3rd sequence, it took a few minutes to insert all suffixes. This would amount to months before the trie would finish building.
Marisa-Trie Difficult to build the trie since it’s read-only. This requires reading the entire proteome into memory, generating all suffixes and inserting them one by one into the trie with the correct sequence to protein mappings. There’s no point in pursuing this since it requires generating a suffix array and then inserting it into the trie.
Py-Radix Isn’t general purpose, only useful for IPv4/6 addresses.
It’s disappointing to discover that there doesn’t seem to be an implementation that works at the scale that indexing a proteome requires. At one point there was an issue filed to add a trie container to python, which in my opinion was rightfully rejected, but it seems like there isn’t a good library yet. Using Biopython’s implementation was the closest I got to building my trie. Time to dust off some of my old C and GDB knowledge and debug their implementation.
Update: Props to the original author of the Biopython trie module, Jeff Chang. He provided a fix the day after I reported this on the mailing list. I’ve come to use this implementation in my code. Serialization isn’t that big of an issue for me, as building the trie takes around 30 seconds.
I implemented my suffix trie with the trie included in Biopython. It’s flexible and fast. I’ve chosen to use lists as values since a suffix can map to the suffix of more than one sequence. I’m currently using it index around 100,000 protein sequences and it takes around 30 seconds to build the trie. It’s also reduced my memory consumption from ~22gb with a suffix array to ~10gb, more than a 50% reduction!
Originally published at kevinformatics.com on 2013–01–30