Is Reidentifiability a Risk?
Open versus restricted access for summary statistics
Mark J. Daly
In 2008, Homer et al. published the first analysis establishing that in some scenarios, it is possible for an individual sample to be ‘re-identified’ as a participant in a study for which summary allele frequency information was provided. This analysis was later methodologically extended (Jacobs et al. 2009; Im et al. 2012) to analyses of test statistics from case-control studies, though the limits of how large and complex a mixture of individuals or meta-analyses of studies such reidentification is possible in has been demonstrated to have practical bounds (Visscher & Hill 2009; Sankararaman 2009). Rather than detail and debate the maths, for the majority of this piece let us assume that reidentification may be possible — though many data sets today are now beyond the boundaries suggested above. The aim of what follows here is to define terms surrounding reidentifiability and summary statistics from genetic studies and articulate the potential risks and established benefits that should be considered to define principles of summary data sharing.
First, it is important to clarify ‘reidentifiable’ versus ‘identifiable’. Information that might be used to identify an individual is often described as PII (Personal Identifiable Information) and in the context of health care, PHI (Protected Health Information) subject to HIPAA Privacy Rules. These include all nature of information (addresses & phone numbers, social security, medical record and account numbers, and specific dates (birth, medical procedures) in addition to names and photographs. Of note, DNA is not per se considered identifiable, since there is no way to look at a DNA sample and determine the individual it came from, unless a database containing names and DNA fingerprints of individuals was available [Footnote 1]. For this reason DNA is not at present considered a HIPAA identifier.
The papers referenced at the outset address an entirely different point, that is, if you were given a DNA sample and genotyped or sequenced it, could you determine whether or not that sample was included in a published genetic study. Certainly, given access to individual-level data from a genetic study, one can readily match any two DNA samples and determine the identity of the individuals or their specific degree of relatedness out to relatively distant cousins. And with the exception of individuals who explicitly consent to have their DNA profiles made publicly accessible, individual-level genetic information is always considered to be under ‘restricted access’ — that is made available to researchers with Institutional Review Board (IRB) approval for specific and stated research purposes that are consistent with the informed consent signed by the DNA contributors.
The issue raised by the aforementioned papers is that, given a DNA sample, it may be possible given a reference database of allele frequencies from a limited population sample, or summary statistics set from a published genetic study, to identify whether or not the new DNA sample is from an individual that is in the DNA mixture. While DNA mixture analysis has a role in forensics, it is unclear, beyond establishing the mathematical feasibility of such a mixture decomposition, what the purpose of such an activity as pertains to published summary statistics from a genetic research study (e.g., the p-values and effects from a genomewide association study (GWAS) comparing cases and controls) would be or how much ‘risk’ this constitutes. The most extreme scenarios seem to involve that a DNA sample could be a) obtained from an individual; b) sequenced; and then c) used to identify that you had contributed to a genetic study as a case (thereby revealing your ‘case’ status for that disease). That this would be considered a risk depends somewhat subjectively on one’s perspective on the following — what parties, presumably without access to your medical information that would include your diagnostic status, would nonetheless acquire your DNA and speculatively and illicitly sequence it at great cost — and why would they then conduct complex analytic decomposition to discover the unlikely fact that you had participated in a published genetic study as a case or population control?
In material many of us encounter as a routine part of human subjects training, Utah’s Jeffrey R. Botkin sagely points out that while it is debatable whether DNA information can ever truly be considered anonymous, “there is little incentive for researchers to re-identify data or specimens other than to demonstrate that they can do so. Accordingly, to date, there are no instances of re-identification of data or specimens for illicit motives. Further, researchers obtaining access to some de-identified datasets, like the National Institutes of Health’s (NIH) dbGAP resource, are required to guarantee that they will not attempt to re-identify data. Any attempt to re-identify data or specimens would be a serious breach of research ethics unless explicitly authorized by an IRB for a critical purpose (for example, notifying a subject of a clinically relevant genetic finding).” [CITI Training — CITI Genetics Workgroup]
What is inarguable is that the confusion over the potential scenarios (not to mention the complex mathematics surrounding them) had an immediately chilling effect on the sharing of results from genetic studies. NIH websites and those of many disease consortia immediately removed summary statistics while this risk was being considered and evaluated and while the concerns have waned over time, there is still uncertainty and inconsistency across the research community with respect to whether summary statistics can be shared fully publicly, or whether they should be considered as ‘restricted access’ data along with individual level information. The two scenarios are quite distinct however: individual complete genetic profiles connected to medical information obviously carry a great deal of information that is rightly considered sensitive — whilst identifying that an individual is a member of a sequenced population cannot be connected to individual information beyond the shared label of the group members (e.g., population of origin or disease group).
While the distinction between ‘open access’ and ‘restricted access’ may seem a technicality, in practice it is not. Defining that summary statistics belong in the category of restricted access individual level data comes at a hefty cost. Today the most valuable genetic data sets are those compiled by consortia of researchers, often involving clinical samples and researchers from in excess of 50–100 sites and summarizing data from tens to hundreds of thousands of participants. The insights from such experiments have transformed our understanding of the biology of numerous immune-mediated, psychiatric, cardiovascular and neurodegenerative diseases and free access to these data and the clues they hold are critical to biomedical and pharmaceutical development — the precise goals that research participants hope to advance through their contributions. Thus in considering the question of open versus restricted access for meta-analysis summary statistics, we should engage in a balanced consideration of a) the likelihood that any individual harm or contravention of informed consent could occur if such data is open with b) what damage is done by slowing or preventing access to these data should access be restricted.
While summary genetic information is uniformly considered non-identifiable, the spectre of reidentifiability persists as a bogey-man of uncertainty for IRBs and researchers. It is worthwhile, however, to consider the ramifications of restricting access to these summary data as this idea raises some very challenging issues. In the meta-analysis scenario above, who should be responsible for maintaining and distributing such data? Presumably a single institutional IRB or agency such as the NIH’s dbGAP could take on the task for a consortium of clinical sites (saving the need for all contributing sites to adjudicate every request for summary statistics their site contributed samples to) — but in this case could the data then only be made available to investigators with a proposal consistent with the consent of every individual in the meta-analysis? This rapidly becomes deeply problematic — as most meta-analyses invariably include samples with different individual consents, would the non-identifiable meta-analysis results then be available only for a restricted set of uses consistent with all consents?
One scenario of note involves informed consent documents that included a statement prohibiting use by for profit entities. This was not infrequently a provision in consent documents several decades ago and if we were to consider restricted access to summary statistics of meta-analyses that include any samples with this provision, it would be determined pharmaceutical companies could not access, and thereby usefully inform their development pipelines, with these data. In considering this example, it is perhaps valuable to consider the intent of the consent statement. Quite obviously, summary meta-analyses, the biological insights into disease they hold, and unusual scenarios where such data might be re-identified were all far beyond view of any of us 20 years ago. What were pressing concerns of the time however, were ideas that DNA discoveries might be patented and that companies might attempt to own DNA sample banks in order to monetize those resources and/or the biological insights that might come from them. It is highly unlikely, considering the proven altruism of research participants and their desire to have researchers studying their conditions and developing diagnostics and therapies, that researchers or subjects choosing to limit ‘industry’ access to their samples would have ever intended to limit access to the developed biological insights from those samples from precisely the industry that could develop those therapies.
For my part, the answer to this question is clear — given that summary statistics are uniformly considered non-identifiable data, contrived scenarios leading to as yet unrealized potential risks of reidentifiability are not sufficient in my opinion to preclude free, open, public access to summary statistics from genomewide association studies given the fundamental importance of such data and the clear value to keeping it open access. I am thereby heartened that many disease consortia have consistently made such data completely publicly available (notable longstanding examples being the International IBD Genetics Consortium (studying Crohn’s and colitis), the Psychiatric Genomics Consortium (studying schizophrenia, bipolar, MDD, ADHD and autism), as well as those studying type 2 diabetes, myocardial infarction and many more.
Moreover these groups have uniformly endorsed sharing of allele site and frequency information from exome and genome sequencing studies in order to create valuable public resources such as the Exome Aggregation Consortium (ExAC) which have revolutionized our ability to interpret genome variation in clinical and research settings. It is theoretically possible to determine that an individual for whom you have independent genetic data is part of such a resource, but the aggregated multi-phenotype nature of such resources means that knowledge of an individual’s membership in such a resource carries little or no information about their health status (in the case of ExAC, only being an adult without a rare or serious pediatric disease is implicated), and thus poses extremely minimal, if any, risk to participants.
As noted above, despite in some cases a decade long commitment to this open sharing of summary site, frequency and association information, no unusual, untoward or adverse events have arisen from this sharing activity. What has arisen is that these data sets have been widely used in the development of clinical genetic paradigms, the advancement of biomedical research into these specific diseases, the development of foundational advances in statistical and population genetics, and in promoting fundamental discoveries of the relationships between diseases or between diseases and quantitative intermediate phenotypes and biomarkers that might serve as valuable surrogate endpoints. In short, enormous benefit to biomedical endeavors and patients have been accrued directly because of the free availability of these summary data.
Many other disease consortia and IRBs, however, have not made this data open-access as there is no clear consensus in the field on the issues discussed in this note nor any sense in which it is obvious who is responsible for determining the rules by which non-identifiable summary data should be shared. While some groups may be using the uncertainty as a smokescreen so as to continue to maintain control over summary statistics, this is likely a minority. My observation is many would like to but have received specific instruction from IRBs, ethics panels or funding agencies that they should not — just to be on the safe side — even while simultaneously as many or more have received the green light from those same bodies at other agencies and institutions. Here, whilst I have steered clear of it in this document, an updated look at the realities of the mathematics of reidentifiability might be in order if there were still groups with genuine concerns about reidentifiability. While the original work in this area dealt with idealized scenarios (recognizing which of a small number of samples from a homogeneous population were present in the summary statistics of a single GWAS), most disease consortia are now sharing composite meta-analysis results compiled from tens of individual studies of differing, if not diverse and/or admixed, ancestry. In such a scenario no human being would reliably match the composite frequencies from such a mixed population, and particularly when the precise underlying ancestry mixtures are not provided, it would seem exceedingly low-risk that an individual sample could be traced back to one of many studies that contributed their summary statistics to a global meta-analysis, if only the global meta-analysis summary statistics were made open-access.
Open sharing of results from large-scale genetics research efforts has become an essential tool used to advance clinical, molecular and pharmaceutical research. My hope is that the genetics research community can thoughtfully work towards a greater consensus on many of the issues I’ve only touched on in this note and would propose that our professional societies (such as the ASHG) might convene more discourse on this to provide non-binding but thoughtful consideration and guidance to IRBs and researchers who continue to wrestle with these issues. Were we to move to a more restrictive model in which the largest and most valuable frequency resources and summary association statistics were not open access and freely available to clinicians, academic researchers and industry researchers around the world, the impact on progress in biomedicine could be dramatically curtailed. Instead of relying entirely on subjective interpretation of the application of individual consent to group summary statistics, I would urge us also to consider the intent of those individuals consenting to genetic research, and their hopes that their participation contributes to swift progress in biomedicine.
Thanks to many for helpful discussions on this topic over a long period of time and in particular to Joel Hirschhorn, Daniel MacArthur and Ben Neale for critical comments and to Carole Ober for a valuable discussion earlier this year.
Footnote 1 — Technically, it has been formally demonstrated that in some cases, the unique relationship between male Y chromosomes and last names (both inherited reliably along the paternal lineage) might enable the identification of an individual should the Y chromosome sequence and last name each be rare and previously conclusively linked (Gymrek et al. 2013). This scenario relates only to individual level data access and not the summary statistics sharing described herein.