Arrival of the Fittest

I read a thought-provoking book recently on evolution that I’m going to sneak into this blog by stretching and making an analogy with agile development. The book was “Arrival of the Fittest: Solving Evolution’s Greatest Puzzle” by Andreas Wagner. The “puzzle” that Wagner references in the title is trying to answer the question of why natural selection is so effective at generating diversity? That is, there is no argument that the key underlying mechanism for evolution is natural selection, but why should it be so effective? Is there something else going on in the structure of biological systems that makes this mechanism work so well in generating the enormous diversity we see around us? The arguments Wagner makes shares some underlying themes with a couple other books I’ve discussed in the past, “At Home in the Universe: The Search for Laws of Self-Organization and Complexity” by Stuart Kauffman and “The Plausibility of Life: Resolving Darwin’s Dilemma” by Kirschner and Gerhart.

Wagner analyzes this from the perspective of multiple different biological levels; I’ll focus on his discussion of protein structure. Let’s consider a protein of 100 amino acids (actual proteins are significantly longer). Since there are 20 different kinds of amino acids used in proteins, the state space of possible protein sequences is of size 20¹⁰⁰ (a figure Wagner calls “hyperastronomical”). Each protein is effectively a node in this hyperastronomical state space with edges connecting it to the 1900 other proteins that differ by 1 amino acid.

So a mutation is a traversal over one of these edges to a new node in the state space. This mutation is neutral if the function of the protein doesn’t change. The actual function of a protein is determined by how it folds and exposes various active binding surfaces, not simply the order of the amino acids. The mutation is non-neutral (either positive or negative) if the function of the protein changes. The function does not change if the protein still folds the same and exposes the same active binding surfaces. It is an immensely complex computation to determine a protein’s structural folding behavior simply from a knowledge of its amino acid structure so our understanding of the higher level characteristics of this state space has been limited. Some questions about this higher level structure one might ask are 1) do mutations tend to be non-neutral or neutral? 2) how far in the state space can we travel following a path of neutral mutations? 3) if we travel through a neutral path, is the function of the non-neutral mutations available from this new location similar to where we started or different?

Because modeling protein folding behavior is so expensive, the explorations that have been done to date involve sampling — looking at specific sequences and exploring the structure of the network from that specific location. What these studies seem to indicate is that in fact you can explore vast sections of the state space by following neutral pathways. In addition, the functional behavior of non-neutral mutations changes as you move through the state space. That is, if I look at the functional behavior of the non-neutral mutations from one location and then move through neutral pathways to another location, the functional behaviors available from there will be different. That has a couple consequences. It means that when we look at a population of individuals that have a distribution of alleles that differ in a neutral way, that doesn’t represent simple irrelevant random genetic drift, it means that as a population they are “close to” (a single mutation away from) many more functional states than they would be if there was only a single allele in the population. For a single amino sequence, it also means that in order to reach some distant functional state (e.g. one that required mutating 5 amino acids) it does not have to follow a path where every single one of those mutations is positively selected for. Essentially, if we view the landscape of functional states as a bumpy one with various local but sub-optimal maxima (or maxima that have become sub-optimal because something in the environment has changed), we can treat these neutral pathways as “footpaths” that can lead us from a spot on one peak to another without having to traverse the valley. So neutral mutations or “genetic drift”, rather than being some wart on the side of the theory of evolution, ends up playing a key role in the dynamics of creating diversity because of this higher level structure of the overall state space.

Perhaps you can see where I’m going to take this in bringing it back to software development. Under one development practice, major functional changes requires disruptive (and implausible) overhauls of the entire system at one time. There are no valid intermediate states; we need to throw the system up in the air and reassemble on the way back down. You might argue that was historically our methodology for large software projects.

Alternatively, following practices of continuous integration and continuous stability, you can change a system by making a series of transformations or refactorings that are neutral to the external behavior of the system but then enable you to make a substantive change to the system in a single shorter, more predictable step.

I am making an analogy here; I am not arguing that there is some underlying mathematical reason why the higher level structure of the protein state space should be analogous to the structure of software systems. Additionally, nature has no alternative but to take these incremental pathways — we could decide that it was more efficient to “just start over” in altering a software system. However, for large software systems, certainly my judgment has shifted over time to a belief that this continuous model of integration and stability is the only way to reliably, predictably and efficiently enable major new functionality. You get three kinds of inefficiencies when you throw everything up in the air. The first is usually called the long tail of integration and stabilization. Essentially, this is all the stuff you didn’t plan for because you were so focused on building the cool new part. A continuous approach forces you to take this into account along the way. This is not necessarily an inefficiency in the cost of execution, but it is a significant inefficiency in planning — which might actually be more critical. The second form of inefficiency comes when doing programming in the large with many developers. It is just way too easy to externalize the cost that the instability you are introducing has on the rest of the team. As you make the system less stable, all other development and testing gets more expensive and less predictable. This is a huge and largely hidden cost. The third form of inefficiency comes from backloading all the actual validation and learning that comes from real usage and deployment. By making the system unusable for long periods of time you have essentially halted all ability to learn from real usage.

I am always fascinated by books that take a set of facts you thought you understood and provide an entirely new perspective on them. “Arrival of the Fittest” definitely qualified.