Classic Computer Science Problems in ̶P̶y̶t̶h̶o̶n̶ Scala — Trivial Compression

Brian Schlining
Mar 19 · 2 min read

I recently purchased the book Classic Computer Science Problems in Python to help me brush up on my python. As I’m working through the Python code in the book, I’m taking some time to also port the Python to Scala. Below is an example the illustrates trivial compression for efficiently storing genes. A little science refresher, genes are typically made of Five nucleobases — adenine (A), cytosine ©, guanine (G), thymine (T), and uracil (U). Uracil (U) and thymine (T) are very similar except uracil is found in RNA, thymine (T) replaces uracil in DNA.

Below I’ll show the example code from the book for compressing the in-memory representation of a DNA sequence, which can have bases of A, C, G, or T. For those new to computer science, compression techniques allow you to use less memory to store information. Since we only have four possible nucleotide values we can encode them using 2 bits per character: 00, 01, 10, or 11. This will compress the same DNA information into at least 1/4 of the original memory, depending on the character set used for your String encoding.

Python

In this code, each nucleotide is converted to its corresponding 2-bit representation and prepended to bit_string . To decode it, we reverse the mapping; we decode each 2-bit chunk of the bit_string back to the corresponding character. Python makes this possible as its integer type just keeps expanding and can be as large as you need it to be. We can effectively add bits to it until our computer runs out of memory.

# Usage
from sys import getsizeof
original: str = "TAGGGATCTAACCGTTATATATATATAGCCATGGATCGATTATATAGGG" * 100
print("original is {} bytes".format(getsizeof(original)))compressed: CompressedGene = CompressedGene(original)print("compressed is {} bytes".format(getsizeof(compressed.bit_string)))print(compressed)print("original and decompressed are same size: {}".format(original == compressed.decompress()))

Scala

In Scala, I’m using essentially the same type of compression. Instead of a magically expanding integer value, like Python has, Scala provides a BitSet that stores arbitrary sequences of bits. A BitSet is an interesting class. You flip a bit by using its add method. For example, bitSet.add(1) flips the bit at position 1 in the BitSet. You then can query if a bit is flipped or not with the apply method, e.g. bitSet(1) == true. Note that in order to decode the bit set we have to track the number of nuleotides in our original DNA sequence. I use a case class to store the BitSet and nucleotide length with each other.

Disclaimer: I’ve used an imperative style below rather than a more recursive or functional style of programming. This was done to keep the code as conceptually-similar to the python as possible.

// Usage
val original = "TAGGGATCTAACCGTTATATATATATAGCCATGGATCGATTATATAGGG" * 100
val compressed = CompressedGene(original)println(compressed)

Brian Schlining

Written by

Polyglot coder. Deep-sea Researcher. Zazen aficionado. I think squids are pretty cool.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade