A few months ago, I singled out an article in BioCoder about the appearance of open source biology. In his white paper for the Bio-Commons, Rüdiger Trojok writes about a significantly more ambitious vision for open biology: a bio-commons that holds biological intellectual property in trust for the good of all. He also articulates the tragedy of the anticommons, the nightmarish opposite of a bio-commons in which progress is difficult or impossible because “ambiguous and competing intellectual property claims…deter sharing and weaken investment incentives.” Each individual piece of intellectual property is carefully groomed and preserved, but it’s impossible to combine the elements; it’s like a jigsaw puzzle, in which every piece is locked in a separate safe.
We’ve certainly seen the anticommons in computing. Patent trolls are a significant disincentive to innovation; regardless of how weak the patent claim may be, most start-ups just don’t have the money to defend. Could biotechnology head in this direction, too? In the U.S., the Supreme Court has ruled that human genes cannot be patented. But that ruling doesn’t apply to genes from other organisms, and arguably doesn’t apply to modifications of human genes. (I don’t know the status of genetic patents in other countries.) The patentability of biological “inventions” has the potential to make it more difficult to do cutting-edge research in areas like synthetic biology and pharmaceuticals (Trojok points specifically to antibiotics, where research is particularly stagnant).
The free-software and open source movements have done a lot to enable innovation in computing. We have a rich “commons” of software (Linux, Apache, MySQL, Hadoop, to say nothing of the many tools from the GNU project). This software commons forms the technological basis for just about every technology company in existence today, including Facebook, Google, Apple, and even Microsoft. Can the same ideas be equally productive for biology?
I believe so. But exactly how to apply those ideas isn’t clear. As tempting as the analogy is, biology isn’t computing. What does (or should) open source mean for biology? We don’t yet have an answer to that question. Yes, it’s reasonably easy to patent or copyright a long string of As, Ts, Cs, and Gs. And for similar reasons, we could apply any of the open source software licenses to that sequence. But is that sufficient? And what does that mean? I’d like to push on those questions a bit harder.
An open source genome?
In computing, the notion of “open source” has a clarity that doesn’t necessarily extend to biology. We know what source code means: it’s a more-or-less complete expression of what a computer program does. The source code may be a couple of lines long, or millions, but when you run the code, the computer does what it’s told to do. We don’t yet have that kind of understanding in biology, and it’s possible we never will. It’s a truism to say that DNA is a programming language that we don’t understand. While we understand (to a limited extent) how DNA encodes proteins, we’re far from understanding the complexity of that mapping. One modification to DNA may have many interacting effects, some benign, some fatal. Our notion of “effects” and “side effects” confuses the issue; side effects are just the effects we don’t like. As far as the organism is concerned, though, there are only effects. And we are far from understanding all the effects of any modification on all but the simplest biological systems.
So, what does it mean to say that DNA sequences are a kind of genetic “source code” for living organisms? The process by which DNA is used to build proteins is extremely complex; the code is read in both directions; furthermore, there’s a logic to gene expression that we don’t completely understand. If the same genetic information is present in all cells, why are some cells muscle and others liver? Genes encode proteins, and (to use a programming analogy), they’re sort of like assignment statements. But you can’t build a program if you only have assignments. You need conditional logic and other control structures. We are far from understanding DNA’s control structures and how they work. So, while we can call DNA a “program,” open sourcing biology is qualitatively different from open sourcing a program written in Java or C. We really don’t yet understand what the biological program means. What is an open source gene? What is an open source protein? Those are important questions, and we don’t yet know the answers.
A common language
Software developers have one key advantage over biologists: software developers speak a common language. Well, more realistically, many common languages; but the differences between Python and FORTRAN are small enough that Python programmers and FORTRAN programmers can meaningfully communicate with each other. DNA may be a programming language, but that won’t help us communicate if we don’t understand its syntax.
As Trojok says in the white paper, “a future bio designer should be able to code the properties of a living system…by describing the desired features in a biological programming language.” That programming language could be DNA, properly understood; but a better analogy might be to see DNA as the machine language — the 1s and 0s, of biology. While the pioneers of computing dealt directly with 1s and 0s, we now describe a program’s “desired features” in high-level languages like Python; programming in binary only happens in a few special circumstances.
I doubt that we’ll end up with a single biological language; just as in computing, we will probably end up with dozens (if not hundreds or thousands). But whether there’s one or many, we need those languages to exist. And we need those languages to be part of the commons, not proprietary creations as they were in the “dark ages” of computing. Today, there are very few programming languages that don’t have an open source implementation, and it’s very difficult to imagine a new programming language that doesn’t start as an open source project (Swift being a significant exception). High-level languages for biology will be the same: to succeed, they must be part of an intellectual commons. Proprietary languages are no good for sharing ideas.
In the last few years, we’ve discovered that computing isn’t as clear-cut as we thought it was. In 1990, it was relatively easy to look at a program and say that we understood what it did. Now, when almost all significant applications run on complex distributed systems, tens to thousands of computers that are “in the cloud,” it’s much more difficult to reason about what a program can or can’t do. Look at the Shellshock bug in the Bash shell: that bug might have existed when Bash was first developed, but it would have been meaningless, unexploitable. In 1989, our computer networks were primitive. We didn’t have web servers, and distributed systems were exotic, experimental beasts. It was relatively simple to understand all (or almost all) of the situations in which a program could execute.
Modern computer systems are much more like biological systems than the computers of the 80s and early 90s. Both biologists and software developers have to deal with extremely complex systems, emergent behavior, and unintended consequences. Open source hasn’t been immune to the problems that arise when you place software in new contexts, and biologist have to be extremely careful about the consequences of introducing unforeseen changes into organisms, or releasing organisms into the wild.
The Bio-Commons has a bio-ethics subgroup (currently mostly empty) for discussing ethical issues. How do we manage systems that defy deterministic understanding? What do biological systems mean, and how can we use them? What responsibilities does a researcher have for his creations?
It’s interesting that the Bio-Ethics group lists “the definition of individuality” as one of its concerns. Identity and individuality are certainly an important concern in software, but those issues rarely appear in the context of open source software. You write software; you apply a license; you use software in accordance with that license. What stake does individuality have in the software you write or use? Perhaps open source software and the future bio-commons can learn from each other.
When Richard Stallman founded the Free Software Foundation, his goal was to preserve the freedom to share software. Sharing was fundamental to the culture of computing in the 1970s, but it was threatened by the shift that brought about the start-up booms of the 1980s: computing itself became a commodity, and software became monetizable. Developers stopped sharing their work (in many cases, were no longer allowed to share their work) because software was something you wrapped in a package and sold. Software faced the threat of the anticommons; the free-software and open source movements are a reaction to that threat. And indeed, the open source movement has won.
While “sharing knowledge” has always been a scientific ideal, many outside of the sciences would be surprised just how little knowledge is actually shared. Results are locked up in journals, which live behind carefully maintained (and extremely expensive) paywalls. Papers share results, but rarely share the actual data or the software used to analyze the data. Papers describe experiments, but rarely describe them accurately enough for their results to be duplicated reliably.
As we’re engaging in research, we need to share data, we need to share code, we need to share experimental designs. But we don’t yet have standard languages for sharing that information, or repositories in which to store it. Much of the data collection in the sciences is fairly haphazard. We’re limited by tools and methodologies that were developed when data was hard to get and data storage was expensive. Now that you can buy terabyte disk drives for a few dollars (this morning, I see a 3-terabyte external disk drive for $120 retail), and fill those disk drives using automated instruments controlled by an Arduino or Raspberry Pi, we have the ability to generate and store data in bulk. We have the ability to instrument and monitor every stage of an experiment in detail; but that’s not happening in biology, at least not on a regular basis.
This is an area where biologists can learn from software developers. Modern software systems throw off gigabytes of data, and we have built tools to monitor those systems, archive their data, and automate much of the analysis. There are free and commercial packages for logging and monitoring, and it continues to be a very active area of software development, as anyone who’s attended O’Reilly’s Velocity conference knows.
One critical goal of the Bio-Commons is to facilitate sharing. And I’m excited that they realize how little we know about sharing. We can talk about “open source” biology, but we don’t really know what we mean. Are we talking about some genetic code? Are we talking about proteins? Are we talking about experimental procedures (protocols)?
In addition to the Bio-Commons, we see start-ups like Synbiota working on cloud-based repositories for storing and sharing biological data, much as GitHub serves as a repository for source code.
I’ve often said that the revolution in biology depends on a revolution in tooling. That revolution is also under way; I’ve come across many start-ups working on tools for biologists, ranging from the extraordinarily ambitious to the humble, and looking at customers from huge industrial laboratories to small bio-hacking spaces.
Again, it’s important that the tooling biologists use be part of the biological commons. You can see it in software projects like Cytoscape and BioPython. You can also see the tooling revolution in the OpenPCR project, the low-cost homebrew PCR described in the new issue of BioCoder, and the open sourced laboratory robotics platforms from Modular Science and OpenTrons.
We’re making tremendous progress in our understanding of life; we’re clearly at the start of a revolution in biology. But for that revolution to get going in earnest, and to avoid settling into a dystopian anticommons, we need to improve our ability to share. The computer revolution arguably started in the 1960s, but it really didn’t get going until we understood the importance of shared code. The biological revolution will be similar, but with one big advantage: we can see what the open source movement has done. Many of the problems we face have already been solved, or are being solved.
We are building a biological commons. Whether that’s the Bio-Commons that Rüdiger Trojok and his collaborators are building, or something that hasn’t yet started to take shape, its time has come. It’s the fermentation vessel in which the revolution will grow.