iDNA: Why Apple (and Google) Want Your Genetic Information

Alex Senemar
Sherbit News
Published in
10 min readAug 31, 2015

Apple is turning smartphones into powerful tools for medical surveillance. Using the HealthKit platform, patients can send readings from home medical monitors to their health-care provider, and receive personalized advice from the Watson Health Cloud in return. With ResearchKit, scientists are building software for conducting large medical studies, collecting data from smartphone sensors and surveys ‘pushed’ to users’ phones. Now, after an exclusive report by the M.I.T. Technology Review, rumors are swirling that Apple’s next major health initiative will involve genetic information collected from iPhone-owning volunteers. So far, the historically secretive company has refused to comment on the speculation — but why would Apple be interested in your genome?

Genomics 101

In most human beings, every somatic cell has 23 pairs of chromosomes (one set from each parent) packed tightly together in the cell nucleus, and each chromosome contains one DNA molecule. DNA is a nucleic acid, a polymer made up of smaller, repeating molecular units called nucleotides. Each nucleotide is composed of a carbon sugar molecule (called deoxyribose, which gives DNA its name), a phosphate group, and one of four nitrogen-containing bases: adenine (A), thymine (T), cytosine (C), or guanine (G). DNA has the structure of a ‘double helix’ — chains of sugars and phosphates form the main supports, coiling around each other, linked together by pairs of nitrogenous bases: A always bonds with T, and G always bonds with C. These links are called “base pairs,” and the order of these pairs (or “base sequence“) is the “code” that stores biological information.

The “human genome” is the complete set of base sequences contained in the 46 chromosomes. It is estimated that 99.5 percent of the DNA sequence for all humans is identical — the half of one percent of the sequence that varies between us accounts for the astonishing scope of human diversity. Two percent of the genome is involved in encoding proteins, chains of amino acid residues that perform a vast array of essential cellular functions. There are twenty kinds of amino acids, and they can be placed in different orders to produce more than 100,000 kinds of proteins; specific nucleotide sequences contain the ‘instructions‘ for amino acid sequences that build specific proteins. In the past, most studies of genetic variation in humans focused on “Single Nucleotide Polymorphisms” in these ‘coding’ regions, where a single nucleotide — A, T, C, or G — differs in the sequence.

To understand how these variations are linked to disease susceptibility and the effectiveness of medical treatments, an international research consortium began the Human Genome Project in 1991, a nearly $3 billion endeavor to map and sequence the 3 billion nucleotides of a ‘reference‘ genome. Researchers developed a method called ‘shotgun sequencing,’ where the genome is copied and shredded into random pieces (like the blast pattern of a shotgun). The fixed-length fragments are inserted into an artificial chromosome that can be grown in E. coli bacteria — when the bacteria replicates, a ‘library’ of DNA clones is created. These clones are then broken up into even smaller fragments, to be sequenced using the ‘traditional’ method of tagging nucleotides with fluorescent dyes. Finally, a computer reconstructs the complete genome by tracing where each of the millions of fragments overlaps with another, and painstakingly piecing them back together — like a massive, electronic jigsaw puzzle.

It took scientists ten years to complete the first draft map of the reference genome. Since then, sequencing methods have advanced at an astonishing pace — last January, California-based Illumina unveiled a system that sequences more than five genomes per day, at a cost of one thousand dollars per genome. Empowered by the rapidly declining costs of technology, geneticists poured their research grants into sequencing and comparing the DNA of ‘sick’ and ‘healthy’ people, hoping to discover the culprit “diabetes genes” or “depression genes” responsible for illness. But they gradually began to realize that most diseases (with some rare exceptions) aren’t caused by a single genetic “error” — rather, a person’s risk is linked to combinations of hundreds, and perhaps tens of thousands, of variations in the base sequence.

In order to crack statistical problems of this scale, geneticists will require substantially larger databases — and more computing power — than previously imagined. As Arthur Toga, an Alzheimer’s disease researcher at U.S.C., predicted: “There’s going to be an enormous change in how science is done, and it’s only because the signal-to-noise ratio necessitates it. You can’t get your result with just 10,000 patients — you are going to need more. Scientists will share now because they have to.” How are researchers and physicians going to get access to exabytes of genomic information, on millions of patients from around the world? That’s where Apple, and the other internet giants, get involved.

The “Internet of DNA”

Imagine that you are diagnosed with cancer — your physician may order DNA tests on your tumor, knowing that every cancer is caused by specific genetic mutations in the cancerous cell. What if there were a global database of genomic information that your physician could query? She could look up every patient who shared your tumor’s specific mutations, what drugs those patients took, and how long they lived, providing some clues into the most effective treatment plan. Major internet companies are now competing for their slice of this potentially multi-billion dollar research industry: an “Internet of DNA” for the “Internet of Things.”

In 2013, Google announced a new cloud computing service called “Google Genomics.” In collaboration with genomic scientists, software engineers at Google have built an interface that allows researchers to easily migrate large amounts of genetic information onto Google’s servers; there, scientists can perform experiments with the same database analysis technology used by Google’s web crawlers (to index millions of websites) and ad networks (to track billions of consumers). Although “BigQuery” was initially marketed as a ‘business analytics‘ tool, Google engineers seized on the possibility of scientific applications, said Google Genomics director David Glazer: “We saw biologists moving from studying one genome at a time to studying millions. The opportunity is how to apply breakthroughs in data technology to help with this transition.”

https://youtu.be/ExNxi_X4qug

“Big genomic data on Google Cloud Platform”

Running fourteen sequencing machines at once, the Broad Institute of M.I.T. and Harvard can decode one genome every 32 minutes, generating 200 TB of raw data — in other words, it would take the institute two months to produce the equivalent of what gets uploaded to YouTube in one day. With Google’s resources, scientists at Broad have developed a Genome Analysis Toolkit, freely available to other genomics scientists, for use on Google’s Cloud Platform: from a command line, a researcher can now easily identify the common variations among a set of genomes, or do a ‘quality control’ check for poorly mapped sections of a genome dataset. When Atul Butte, a bioinformatics expert at U.C.S.F., heard Google announce its platform last year, he said he now understood “how travel agents felt when they saw Expedia.”

Amazon has pursued a similar strategy with Amazon Web Services and their database analysis toolset “Redshift.” Now, Google and Amazon are engaged in a ‘price war‘ to store genetic information in bulk, as the two begin competing for major clients: Google has partnered with the Autism Speaks foundation (to study the genomes of affected children and their parents) and Tute Genomics (which owns a database of 8.5 billion human DNA variants, annotated with their relative frequency, associated traits, and correlate drug effectiveness), while Amazon has scored contracts with the Multiple Myeloma Research Foundation, Alzheimer’s Disease Sequencing Project, and various pharmaceutical companies. Neither company will disclose how much genomics data it holds, but a recent Reuters report speculated that Amazon’s “genome cloud” may be bigger.

For now, the major clients are mostly academic researchers and pharmaceutical companies — but many in the industry believe the focus will soon shift to clinics. Earlier this year, President Obama proposed a $215 million “precision medicine” initiative, to advance the use of genomic information by health-care providers as described in the cancer diagnosis scenario above. A major component of the initiative is a national study of one million volunteers’ genomic profiles: participants in this ‘superstudy’ will have their electronic medical records turned over to researchers, and they’ll be equipped with additional sensors and activity trackers to monitor their behavior and environment in greater detail. The study will also combine data from more than two hundred ongoing genomic studies, by encouraging greater “interoperability” between disparate medical record systems and gene databases.

But there is a fundamental problem with ‘precision medicine’ without a clear solution: privacy. Because genome sequences are totally unique, researchers have found that it is impossible to ‘anonymize’ genetic information. Of course, it’s possible to encrypt databases, strip datasets of ‘personally-identifiable‘ information, and design interfaces that display only the minimally necessary level of detail — but clear standards and procedures for securing DNA do not yet exist. “I’ve heard almost nothing from any of the proponents of the Precision Medicine Initiative as to what structures they’re going to construct and put in place to protect this information,” the president of the Council for Responsible Genetics recently told Al Jazeera. “I’m sure they’ll make all kinds of promises that they are appreciative of the issue.”

https://youtu.be/A0b1xjxuFRo

“Supreme Court OKs Unfettered DNA Collection “

In this pioneering field, there isn’t much experience to draw from — but the case of 23andMe, a Google-funded ‘personal genomics’ company, illustrates the stakes. For $99, 23andMe offered DNA testing kits (a tube for ‘spit samples‘) to be analyzed at the company’s labs; the company would then sequence the genome and return a detailed assessment of customers’ risks for common diseases. In 2013, the Food & Drug Administration banned the test out of concern that the analyses weren’t accurate. But 23andMe’s real business is mining DNA information, not analyzing it — and 80 percent of its customers have agreed to ‘donate’ their DNA data for research purposes. This ‘donated’ data is highly lucrative: Genentech recently paid $60 million for access to the genomes of 3,000 Parkinson’s Disease patients, and has arranged to share its database with Pfizer for undisclosed millions. (Since the F.D.A. ruling, they’ve continued to shipping testing kits in Canada while selling more ‘limited’ ancestry tests in the U.S.).

The company defended these agreements in a statement to Al Jazeera: “23andMe customers consent to research. Ultimately, 23andMe customers own their data. Customers can decide to stop participating in research at any time and remove all of their information from our database.” Leaving aside for now the ethics of selling genomic information to unscrupulous, for-profit pharmaceutical firms, it’s clear that future public health programs will have to develop flexible protocols that give research participants a say in how their information is shared, and the ability to revoke permissions in the future if necessary. To that end, the proposed Obama initiative will offer patients more access to data generated about them than is typical of research studies; the hope is that maintaining lines of communication will enable scientists to follow-up with research subjects with ‘interesting’ genomic profiles — impossible in most large medical studies of the past.

And there lies perhaps the most compelling argument for Apple’s rumored venture into genomics: the success of the five existing ResearchKit apps — studying Parkison’s, breast cancer, diabetes, asthma, and heart disease — has demonstrated Apple’s ability to rapidly recruit volunteers who will willingly share their biometric information, and become active participants in research projects. To join one of the planned studies, you must agree to a gene test ‘spit kit’ (similar to those used by 23andMe) to be submitted to an Apple-approved gene-sequencing center (the first of which are said to be at U.C.S.F and Mount Sinai). The planned studies would look at a “gene panel” of some one hundred medically important genes related to common diseases, rather than a person’s entire genome — these targeted tests, at a large scale, may cost only a few hundred dollars each.

Stephen Friend, a former pharma industry researcher, joined Apple after years of promoting ‘open-source‘ health research projects and ‘patient-centric‘ controls for regulating access to medical information; he claims he chose to work on ResearchKit because of Apple’s relatively superior track record on user privacy. Google and Facebook “make their power by selling data,” said Friend in a recent interview, “They get people information about other people. Apple has said, ‘We will not look at this data.’ Could you imagine Google saying that?” For now, it seems the data will be stored in a cloud, to be maintained by researchers, while insights from the research are pushed to users’ iPhones — but the writer who broke the story speculated that, one day, “it’s possible consumers might swipe to share ‘my genes’ as easily as they do their location.”

The Technology Review report concludes with this important observation: “No law stops individuals from sharing information about themselves. Thus one reason to ’empower patients,’ as rhetoric has it, is that if people collect their own data, or are given control of it, it could quickly find wide use in consumer apps and technologies, as well as in science.” This echoes one of the principal motivations behind the Sherbit platform, the possibility of ’empowering’ individuals to regulate how their own data is used — in particular, in pursuit of the meaningful scientific insights to be gained by studying this information in detail. In the absence of legislation effectively regulating how internet companies handle your information, the only option is to build tools for doing good things with this data, rather than simply allowing it to be exploited by advertisers. As the prospect of an “iDNA” platform has shown, we must tread lightly — and always put the end-user first.

Originally published at www.sherbit.io on August 31, 2015.

--

--

Alex Senemar
Sherbit News

Working on disruptive ideas in blockchain and healthcare.