Member-only story
Burrows Wheeler in Python
The magical algorithm to index and compress large string data then find substrings quickly
The Burrows Wheeler Transform (BWT) was developed in 1994 by Michael Burrows and David Wheeler. In simple terms, BWT is a string transformation that acts as a preprocessing step for lossless compression. BWT has implementations that exhibit both a linear O(n) performance and space complexity. Originally designed to prepare data for compression with techniques like bzip2, BWT has found prominence in bioinformatics allowing the fast mapping of short reads paving the way for high throughput genetic sequencing.
In this article, we will implement a simple BWT in python then show how to find small subsegments with mismatches using a streamlined suffix array BWT.
BWT Algorithm:
- Rotate letters: apple becomes [‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’, ‘apple’]
- Sort rotated words alphabetically: [‘apple’, ‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’]
- Take last column: elppa
BWT of apple becomes elppa