TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

Burrows Wheeler in Python

4 min readSep 9, 2022

--

Photo by SOULSANA on Unsplash

The Burrows Wheeler Transform (BWT) was developed in 1994 by Michael Burrows and David Wheeler. In simple terms, BWT is a string transformation that acts as a preprocessing step for lossless compression. BWT has implementations that exhibit both a linear O(n) performance and space complexity. Originally designed to prepare data for compression with techniques like bzip2, BWT has found prominence in bioinformatics allowing the fast mapping of short reads paving the way for high throughput genetic sequencing.

In this article, we will implement a simple BWT in python then show how to find small subsegments with mismatches using a streamlined suffix array BWT.

BWT Algorithm:

  1. Rotate letters: apple becomes [‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’, ‘apple’]
  2. Sort rotated words alphabetically: [‘apple’, ‘eappl’, ‘leapp’, ‘pleap’, ‘pplea’]
  3. Take last column: elppa

BWT of apple becomes elppa

Simple Implementation of BWT

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Cody Glickman, PhD
Cody Glickman, PhD

Written by Cody Glickman, PhD

Currently a biological data scientist blogging about side projects and things learned through brute force. https://codyglickman.com/

No responses yet