Genetics and CS, Part 3

Dictionaries

Jason Oswald
Learning Computer Science with Swift
9 min readNov 16, 2017

--

In this part, we are going to learn about a new data structure. Our first data structure was an array, and our new data structure, a Dictionary, shares some properties with arrays, but as you’ll see, it is a very different data structure.

Translating RNA to Protein

Resources

Project Rosalind, Apple Documentation on Dictionaries, We Heart Swift

This problem seems pretty straight forward: we need to, three at a time, go through a sequence of RNA and translate the three nucleotides into one amino acid. Since we are clever, we could set up an enumfor our amino acids:

Note that I’ve defined a Bioinformatics error here. You’ll need to add that, or your own, to make this code work.
// example usage
let myAminoAcid = AminoAcid.from( "M" )

Quick Side Bar

You’ll note the bizarre allValues property I’ve added. This makes the following possible:

for aa in AminoAcid.allValues {
print( aa )
}

Though, it isn’t particularly useful for this problem.

The Algorithm

Now this is pretty straight forward

func proteinString() throws -> [AminoAcid] {  // create an array that will contain AminoAcid values
// go through strand three values at a time
// figure out the amino acid for those three values
// append that amino acid to our array
// return the array
}

which we can quickly translate into

func proteinString() throws -> [AminoAcid] {
var proteinString = [AminoAcid]()
for // something {
// figure out amino acid
proteinString.append( aminoAcid )
}
return proteinString
}

We have two problems, the first is we haven’t ever really needed to get stuff three at a time. There are two different simple approaches, I think; I will demonstrate each

stride

for i in stride( from: 0, to: strand.count - 1, by: 3 ) {
let value1 = strand[i]
let value2 = strand[i+1]
let value3 = strand[i+2]
let codon = [value1, value2, value3]
}

Here the values of iwill be 0, 3, 6,… thanks to strideand we can get the intermediate values as shown.

prefix and removeFirst

var workingSequence = sequence 
while !workingSequence.isEmpty {
let codon = workingSequence.prefix(3)
workingSequence.removeFirst(3)
}

Here we just chunking off elements three at a time.

How you want to end up with an array of three nucleotides is up to you.

Mapping Values to Other Values

Now that we’ve defined our amino acids, and determined how to get a codon (we are representing this as an array of three nucleotides), we just have to figure out how to determine which amino acid results from which codon. We could actually solve this problem like this:

func aminoAcid( for codon: [Nucleotide] ) -> AminoAcid {
if codon == [.U, .U, .U] {
return .F
} else if codon == [.U, .U, .C] {
return .F
} // ... and so on
}

Or if you wanted to be a bit more clever:

func aminoAcid( for codon: [Nucleotide] ) -> AminoAcid {
switch codon {
case [.U, .U, .U]: return .F
case [.U, .U, .C]: return .F
// ...and so on
default: //?
}
}

And there is really nothing wrong with that, though I think performance is an issue (every search for a codon requires sifting through all of the cases, which takes time). What we’d rather be able to do is do what we’d do with the codon table, just look up the codon and see what the amino acid is. We can achieve this sort of functionality for the computer using a dictionary.

From Welcome to Swift

A dictionary stores associations between keys of the same type and values of the same type in a collection with no defined ordering. Each value is associated with a unique key, which acts as an identifier for that value within the dictionary. Unlike items in an array, items in a dictionary do not have a specified order. You use a dictionary when you need to look up values based on their identifier, in much the same way that a real-world dictionary is used to look up the definition for a particular word.

In our case, the key for us is our codon– a GeneticSequence created from an array of three nucleotides– and the value is the associated amino acid. Here’s how it works:

// declare our dictionary along with <keyType, valueType>
var codonTable: Dictionary<GeneticSequence, AminoAcid>
// here we can initialize the dictionary with some values
codonTable = [
GeneticSequence( kind: .RNA, fromStrand: [.U,.U,.U]): .F,
GeneticSequence( kind: .RNA, fromStrand: [.U,.U,.C]): .F
// and so on...
]
// we can also add values
codonTable[GeneticSequence( kind: .RNA, fromStrand:[.U,.U,.A])] = .L
// and retrieve values
let codon = GeneticSequence( kind: .RNA, fromStrand:[.U,.U,.A])
let aminoAcid = codonTable[codon] // aminoAcid should be .L

Unfortunately, to get this to work, we have to make some small changes to GeneticSequence so that it is able to act as a key into a dictionary. I’ll go into the details below, but all you need to do is add this as a new file to your project:

All that’s left is set up your table, complete the algorithm, and troubleshoot. You can set up your table either in the function call itself, as a property, or in a new struct.

Uh, I Have a Question

Wouldn’t it make sense to define an enum or a struct for RNA Codons?

It might, though I decided that logically connecting nucleotide’s with an array was sufficient for this problem.

Common Ancestor

Resources

Project Rosalind

In this problem, you take a bunch of sequences and construct a new sequence that is made up of the most common nucleotide at each position. You also need to print out a table of the results at the end.

At first, the need for a dictionary might not be apparent. Indeed, this problem is solvable using only arrays, since ultimately what we need to do is construct a table (2D array) and determine maximums. However, much of your code will be spent translating positions in those arrays to nucleotides. Not impossible, but not elegant either.

Further, it doesn’t make sense for us to shoehorn this functionality into our DNAStrand. Rather, this is a new concept– we are constructing a profile of a bunch of DNA strands. A new noun, typically means a new struct or enum (or class or protocol).

init

Let’s set up our profile data structure.

Think about how this problem will work… we are going to look at a bunch of sequences, store how many times each nucelotide occurs at a given position, determine the most frequent occuring nucleotide at each position, and then construct a new sequence from those most frequently occuring nucleotides.

What this implies is that we need to store a set of nucleotide counts for each position of a strand and that the order matters. The most sensible data structure to choose when order matters is an array, plus that makes it easy for us to map a position in a sequences strand to the position in the counting data structure (since they are both arrays of the same length).

But, what data type will we use to store the counts?

positionCounts = [WHAT TYPE](repeating: INITIAL VALUES, count: length)

The data type needs to store counts for each nucleotide. At a minimum, it needs to store four Intvalues. We could use an array:

positionCounts = [Int](repeating: [0,0,0,0], count: length)

but then we’d have know which position in the array mapped to which nucleotide and it wouldn’t be obvious to someone reading the code what was happening.

A much more sensible solution is to use a Dictionary which maps a Nucleotide to an Int.

positionCounts = [Dictionary<Nucleotide,Int>](repeating: [.A:0, .C:0, .G:0, .T:0], count: length)

You also need to add the positionCounts property to the struct (outside of the initializer):

var positionCounts: [Dictionary<Nucleotide, Int>]

add

mutating fund add( _ sequence: GeneticSequence ) 

When we are given a GeneticSequence, we need to go through it one position at a time and increment the appropriate nucleotide counter. This is now a pretty straightforward problem:

for i in 0..<positionCounts.count {
let nucleotide = sequence.strand[i]
positionCounts[i][nucleotide] += 1
}

The second line looks like a call to a two dimensional array, but it is actually a call to a one dimensional array that contains a dictionary. We use ito reference the dictionary for a given position, and then the nucleotide to reference the appropriate nucleotide.

This code will fail to compile, however.

Optionals

We’ve carefully avoided the concept of optionals up to this point, and, for the most part, we will continue to do so. However, in using dictionaries, we must deal with optionals. The issue is that we can pass whatever key we want into a dictionary to get information out, even if the key isn’t part of the dictionary. The question that must be addressed is: what happens when we pass in a key that isn’t in the dictionary? Different languages handle this situation differently, but the Swift way is to return nil. Without getting too much into, nil is a value that isn’t a value (see, told you you didn’t want to get too much into it) and it needs to be treated specially.

Another way to put this is, when you ask a dictionary for a value with a key, it optionally gives you one. If the key is present, you get a value, if they key isn’t present, it gives you nil.

Instead of handling nil values, however, swift gives us the option to assure the compiler that we won’t be encountering them. Since we know that all of the keys we would be asking for are already in the dictionary , we can force unwrap the value coming out of the dictionary using an exclamation point.

for i in 0..<positionCounts.count {
let nucleotide = sequence.strand[i]
positionCounts[i][nucleotide]! += 1 // force unwrap
}

Of note, if we force unwrap something that could be nil and the dictionary actually returns nil, even though we assured the compiler it wouldn’t, the program will crash.

profileTable

This computed value produces a table of all the counts for each nucleotide at every position. Nothing too fancy here.

consensusString

Here you need to find the maximum value at each position, and the amino acid that it relates to. This is not an area of strength for dictionaries, but the problem is easily solved using some bizarre swift syntax:

aDictionary.max{ a, b in a.value < b.value }.key

The issue here is that a dictionary doesn’t necessarily know what you mean by largest. The code inside of the curly braces says that, if you have two values, a and b, then a is less than b if a’s value is less than b’s. So, we look through all of the keys, compare their values in this way, and then return the key .key.

Curio

An interesting side note is that we got lucky in that we are mapping Nucleotide keys to Int values. The reason is that the computer already knows how to compare integers. Since anything can be a value, we could end up in a situation where the computer wouldn’t know how to compare two values. In this case, we’d have to be even clearer with our code. We will address this problem another day.

Good luck!

Refactoring and Swift Oddities

Now that we’ve slogged through learning about dictionaries and drug our code to a functioning level, let’s do some things to make our lives easier and learn about some of Swift’s little quirks along the way.

subscript

Swift programmers, and other folks, calls using the square brackets subscripting. The term comes from the math world, but it really means referring to a specific part of a data structure. With arrays, the subscript is the index, and with dictionaries, the subscript is the key. However, swift gives us the power to subscript anything. So, for instance, we could clean up our code a bit by defining a subscript for our GeneticSequence struct.

subscript( index: Int ) -> Nucleotide {
return strand[index]
}

The functionality isn’t any different, but using our struct just got a lot easier. Instead of having to type something like mySequence.strand[index] we can now type mySequence[index]. It also hides the underlying implementation details from the users of our structures. This notion is called encapsulation, and we’ve been the beneficiaries of it frequently. We often don’t know exactly how certain classes are working, but we know how to use them. For most people who would use our GeneticSequence structure, it would be annoying to know that they have to reference an array to get at an element of the sequence. For us, it just saves us a bit of typing.

Side bar

If you did want to update your GeneticSequence struct to utilize the subscript function, you might also want to add a computed variable called count which would simply return strand.count. That way, a user won’t have to deal with this oddity:

for i in 0...mySequence.strand.count {
print( mySequence[i] )
}

But can instead enjoy the fully encapsulated experience of:

for i in 0...mySequence.count {
print( mySequence[i] )
}

As it turns out, though, you can subscript anything. A much more tidy use of subscripting is for the codon table you constructed. Assuming you created a struct to house your codon table,

// something like
struct CodonTable {
var table: Dictionary<GeneticSequence, AminoAcid> = [
GeneticSequence( kind: .RNA, fromStrand: [.U,.U,.U]): .F,
GeneticSequence( kind: .RNA, fromStrand: [.U,.U,.C]): .F
// and so on...
]
}

You could now add the subscript function to pass in a codon and return the appropriate AminoAcid so that you could use your CodonTable like so:

let myCodonTable = CodonTable()
let codon = GeneticSequence( kind: .RNA, fromStrand: [.U,.U,.U] )
let aminoAcid = myCodonTable[codon] // potentially dangerous

instead of messing around with the aminoAcid( for: ) function. You will need to figure out how to deal with nil values, though, since you are now referencing the dictionary directly.

--

--