The Art in Our Genes
An Exploration of Evolutionary & Academic Diversity
How a TED talk led to scientific (re‑)discovery, and revealed a delicate artistic representation of our evolutionary history.
THERE ARE FEW CONCEPTS that don’t interest me, and I strongly believe that a broad exposure to an eclectic mix of ideas is the most important education that one can obtain. Perhaps paradoxically, diversity provides the ingredients at the core of novelty; there may be nothing new under the sun, but we still have much work to do when it comes to mixing and matching.
It was thanks to Johnny Lee’s Wii-remote hacks that I discovered TED talks — video complements to my incessant search for knowledge-tidbits that was, before TED, only satiated by the likes of Dawkins and Gladwell. From time to time I’ve had the joy of discovering a real gem of a talk.
Watch the quintesesntial mad scientist or the garden-hose clarinet (which I was fortunate enough to experience in person).
Chris Domas is a cybersecurity researcher who showcased his approach to unraveling the meaning behind enigmatic binary data. Humans are masters of pattern detection — so much so that we find patterns where none exist — and our visual cortices are virtuoso performers in this domain. Domas’ insight lay in harnessing this capability to decode masses of otherwise incomprehensible information.
Some Googling revealed that his approach involved mapping neighbouring pieces of data to geometric objects — the simplest example being the scatter plot. The text that you are reading is, to a computer, nothing more than a sequence of numbers; although the mapping from each letter can be arbitrary, standards exist that allow for interoperability between platforms. An early standard, ASCII, represents TED as 84–69–68 (or more formally as the binary 101010010001011000100). This would, in a simplified version of Domas’ approach, be a 128×128 pixel image with points at the coordinates (84,69) and (69,68). Although this hardly makes for an exciting image, watch his talk to see the wonderful patterns that he revealed in massive data sets.
WE HAVE REACHED A POINT in our exploration of the genome such that reading the code is no longer the problem, but rather it is understanding the complexity of the DNA recipe that continues to puzzle us. This is, in part, due to the sheer volume of data — in its molecular form the human genome is roughly the equivalent of a single audio CD, and a typo that inverts a single bit (a binary 0 or 1) may have dire consequences (you may know these typos as mutations, but as this term has a specific meaning I will use the more pedantic and broad term, variants).
How accurate is this measure? How big is the human genome? Reid J. Robison puts it succintly — it depends.
It’s moments like these that I rely on the conceptual diversity to which I referred earlier (and on Minties; it’s an Aussie thing). A quick mock-up to replace digital data with its genetic counterpart and a secret is revealed.
The simplest question probes what we see, but the interesting question probes why we see this very obvious pattern — first let’s understand how.
In the same way that computers encode data in a binary (i.e. base-2) format, the genome is encoded in base-4 — easily interchangeable. The mapping is again arbitrary, but we can consider the genetic bases to be A=00, C=01, G=10, and T=11. The image to the left is a mapping of the dystrophin gene with each pixel’s intensity representing the relative frequency of neighbouring pairs of codons (i.e. triplets of the bases such as ATG=001110=14); the bottom left pixel corresponds to the frequency of AAA followed by TTT. It turns out that the pale banding corresponds to all of the 2×3-nucleotide regions containing CG.
Moving from left to right (or similarly top to bottom) we cycle through codons AAA, AAC, AAG… in much the same way as we count 000, 001, 002 — the essential difference compared to ‘regular’ decimal counting being that in this form (i.e. base-4) we ‘roll over’ the greater digits after reaching 3 rather than 9. The thick pale bands begin with CG whilst the thin bands end in the sequence; consider how counting 120, 121, 122, 123 results in successive 12 (CG) numbers (codons) whereas counting 111, 112, 113 isolates the CG amongst its opaque counterparts.
The image creator is hosted on my website so anyone can experiment. The source code is available from my GitHub repository.
This answers what, but why is it that we see this so-called CG paucity? The short answer is that this pairing predisposes to errors — the variants (or typos) to which I referred earlier (a long answer exists as well). Variants are what make individuals unique, but not all such diversity is as safe as a change in eye colour; a variant in the wrong region can severely limit an organism’s lifespan or propensity to procreate (this is the more rigorous definition of a mutation — a disease-causing variant). Thus, as Charles would teach us, natural selection should place negative evolutionary pressure on the CG pairings which is what we are seeing.
But “hey evolution is just a theory!” [end sarcasm].
I HAD NEVER SET OUT to reveal this phenomenon; it was simply an exploration of something interesting. Alan Turing is loosely quoted as having said that the point is “the sheer fun of the thing” (although I can’t find reliable actual sources). At the very least I have created an interesting gift for your biology teacher — feel free to make a print, or contact me for a higher resolution image.
Afterthoughts
BEFORE I FOUND SOMETHING more entertaining to pursue, I had considered crowd-sourcing a graphical exploration of the entire human genome in the form of a game (I would also need to devise alternate image mappings otherwise most would simply display the same bands). Players would be shown random genomic regions and be asked whether or not a pattern exists — a scoring system makes for a competitive incentive. This led to an interesting dilemma.
As I alluded to, humans have a propensity for detecting patterns where they don’t necessarily exist. Is that a lion lurking amongst the long savannah grass? Yet again Charles (Darwin) comes to the rescue — mistakenly detect a lion-pattern where none exists and you experience some incidental, beneficial exercise, but flippantly overlook the legitimate lion and you become lion hors d’oeuvres (along with your potential offspring). Plainly asking players if a pattern exists in a graphical representation of genetic data is thus bound to result in scores of false positives.
A simple solution exists in asking “will everyone else claim that a pattern exists in this image?”. This shortcuts the innate, passive ‘gut’ reaction and activates an active, thought-out response (consider reading Thinking, Fast and Slow if this interests you).
How do you assign a score when you don’t know the correct answer? The solution lies in the same ‘everyone else’ proposition. An answer’s score is merely the proportion of other players with the same response.