An Ethical Outlook of Big Data, Privacy, and Genetics

Rafe Batchelor
Computers and Society @ Bucknell
12 min readMay 1, 2020

— Rafe Batchelor and Jeff Lee

The progression of technology is rooted in an understanding of the most fundamental concepts that pertain to any specific development. In other words, to progress, we must first develop a specific, high level of understanding in some field. This knowledge, of course, is the product of experimentation — countless hours studying the conditions of some system to gather information, or data, about its underlying mechanics. Yet, we’ve now reached the point in technological progression where who interprets the data, and who possesses the high level of understanding is shifting. Rather, it is no longer a question of who possesses this information, but what.

Living in the information age should immediately convey one idea about our society based purely on this title itself: that more information exists today than during any other time. This should certainly be the case given the advancements in computational efficiency, ability, and storage throughout the past few decades. In general, we’re aware that this information exists and may be used for good. Yet, the ramifications of moving forward into an era guided by the collection and processing of massive amounts of data are still unknown. It is our duty now to identify where these concerns will emerge from and be preemptive in their treatment. What processes and interprets the data is no longer us, but the computers we own; we are merely here to collect and interpret the results ourselves.

Ethics of Big Data in Genetics

Within the past ten years, we’ve witnessed significant strides taken in the realm of processing massive amounts of data. Developments in computer hardware and mathematics have allowed for the application of machine learning algorithms in almost every field that exists. In particular, data science techniques have found a prominent niche in genetic analysis and engineering. Given the magnitude of data that exists in even the simplest creature’s genome (in the form of the nucleotide base pairs, or ATCG), human processing of these billions and billions of genetic sequences without computational assistance is evidently impossible. We depend on the rapid computational ability of today’s hardware to make sense of the risk factors, malignancies, and diseases that exist in genomic datasets. It is truly an incredible time to witness the growth of the medical field in this regard; yet, the dangers of a “full-steam ahead” attitude cannot be ignored.

Source: genome.gov

You may be curious exactly who is behind the research, handling, and investigation of genetic data at such a large scale. You might also be curious as to whether these developments could directly affect yourself. For this we briefly turn our attention to Boston, Massachusetts for a few examples. We intend for these examples to highlight the reality of health research today at a local scale, as Boston is home to many of the top medical facilities and research centers around the world. Thus, we hope that a sense of just how prevalent genetic information is in today’s world is gained.

Boston hospitals that actively incorporate genetic testing and information into their studies include Massachusetts General, Brigham & Women’s (BWH), and Boston Children’s (BCH); these institutions regularly contribute to the advancement and innovation of genomics, making Boston a prime example of risks and rewards that come from genetic data collection. Two examples of recent institutional developments include:

  • Boston Children’s Hospital’s “BCH Connect,” founded in 2016, acts as the operating system behind their predictive genomics platform. BCH has built database of over 7,000 patients and continues to grow; progressively, the genetic database contains data correlated with epilepsy and inflammatory bowel disease, two conditions that have recently been found to be of genetic origin [1].
  • Brigham & Women’s Hospital’s Preventive Genomics Clinic, founded in August 2019. Dr. Robert C. Green, director of the new clinic, claimed, “…For over two decades, our NIH-funded, randomized trials in translational genomics have generated consistent evidence that there are more potential medical benefits and fewer risks in genetic medicine than previously considered. It is time for this technology to be offered in a clinical context, under the care of genetics experts, to individuals who wish to be proactive about their health [2].”

Returning from this local scope to the big picture —the usage of massive amounts of individual specific genetic information raises ethical concerns regarding the distribution, fair usage, and protection of this sensitive data. We are aware that the health institutions that exist throughout the country and world seek to improve general healthcare for all humans; we often look to the researchers behind this progression as moral beings and we must continue to trust this view as we move forward into an even more data driven society. However, the practices of such large scale data handling institutions must be continuously reviewed and questioned, as the risks of this genetic data falling into the wrong hands can be overwhelmingly costly.

Reason for Concern

These risks are founded in just how lucrative genetic information can be. As previously mentioned, the genetic information studied by institutions is individual specific. The traits, conditions, and diseases of an individual can often be easily identified through just a fraction of their genetic information. Although it may be initially unclear who might want this information, the demand certainly exists. Consider, for example, insurance companies: leaked and stolen genetic data can prove extremely lucrative if sold to insurance companies. The data could potentially be used for genetic discrimination, which includes the denial of mortgages and loans or increasing insurance costs based on discovered genetic conditions. The money saved by insurance companies in these regards would certainly be significant enough to warrant a high price for the data, driving the demand higher and higher. We can easily draw our gaze towards much darker examples of genetic discrimination in the future if we choose; there is certainly potential to worry. However, it is best to maintain focus on where we are now so that policy can be implemented for our protection when these times come[3].

Let’s consider some of the recent cases in which massive amounts of genetic information were leaked or stolen:

  1. In June 2018, DNA testing service MyHeritage was notified by an independent security research firm that a file entitled “myheritage” containing usernames and passwords had been discovered on a private server outside of the company. Upon investigation, MyHeritage revealed that hackers had secured over 92 million username and password combinations, allowing access to individuals’ personal accounts and information [4].
  2. In July 2019, it was revealed that DNA testing service Vitagene Inc. had left over 3,000 customer health reports exposed for a duration of several years while hosted on Amazon Web Services’ servers. The data was very revealing, linking individuals names and birth dates to their likelihood of developing different medical conditions based on genetic sequence data. The breach was one of the most severe to date, with analysts noting the significance of the genetic info jeopardized. One analyst remarked, “This is the first time I’ve heard that genetic data is implicated, which raises a host of privacy issues for the individuals [5].”
  3. In August 2019, Massachusetts General Hospital fell victim to what was most likely a targeted phishing attack. In the breach, over 9,900 patients data were exposed within the department of Neurology. This data has been speculatively linked to later cases of identity theft and insurance discrimination [6].

These are just three examples of the many massive healthcare data leakages we’ve seen in the past two years alone. Looking towards the figures below, we should be able to get an idea of the scale of these breaches from a broader viewpoint.

This first figure depicts the annual number of data breaches and exposed records in the United States from 2005 to 2019 in millions. It should be immediately apparent that this is a rapidly growing issue, especially in the past seven years. We notice a near-exponential increase in the number of data breaches over the period from 2012–2017, paired with a less severe, but certainly increasing number of records exposed.

Source: Statista [7]

This second figure displays a more in-depth view of the growth represented in the figure above. The table contains the number of United States data breaches in millions by industry from 2013 to 2019. Notice the growth of breaches in the medical field over time; from this industry specific view, we see that data breaches in the medical field are second highest overall, and likely soon to be the highest. Additionally, we see that the growth within the medical field is almost linear over time until a spike in 2019. Until more data for 2020 is released, it is interesting to hypothesize whether this spike will continue to increase, as exponential growth would point towards. What we can say with certainty is that an overall trend is present: data leakage in the medical field is a growing issue.

Source: Statista [8]

Lets recapitulate what we’ve discussed so far: We’ve seen that there is a market for stolen genetic data and who the stakeholders in this market are — healthcare institutions who handle the data, those who steal and sell the data, and those who buy the data; we’ve seen that the downstream effects of genetic data leakage can be significant and harmful to citizens; and we’ve seen that genetic leakages occur at a significant and fast enough rate to warrant thought into potential policies for protection.

Managing Risk

Now let’s discuss what has been done already to manage the risks inherently associated with genetic information and what must be done as we move forward into an even more data driven world.

With some foresight into how data would shape the world, policy makers in the United States have enacted a few major laws to protect citizens from the downstream effects of healthcare data mishandling. Such policies include the Genetic Information Nondiscrimination Act of 2008 (GINA) and the Health Insurance Portability and Accountability Act of 1996 (HIPAA).

While both of these policies were enacted to protect the privacy of citizens’ healthcare information, they do so in different ways:

  • HIPAA is a more general approach to the task, encompassing policies for maintaining the privacy and security of individually identifiable health information. This includes really any health information associated with an individual; you might recognize this act from the HIPAA forms you sign when seeing a new doctor. At its core, this act intends to provide that your healthcare information is seen by only those that you have signed off on [9].
  • GINA, a more modern take on protecting citizens from the downstream effects of genetic data leakage, embodies federal laws for the protection of individuals from genetic discrimination in health insurance and employment. We’ve seen how a major source of demand for healthcare data lies within insurance companies, such that heightened risk analysis can be factored into their decision making; this act intends to protect the freedoms and access of all individuals to healthcare, coverage, and employment regardless of their genetic composition and potential for disease [10].

While these acts certainly target a few of the downstream effects of jeopardized healthcare data, we should be well aware that genetic discrimination does and will continue to exist until more and more measures are put into place — where demand lies, supply will be granted. Thus, while these acts (GINA in particular) seek to mitigate the downstream risks of stolen data, we suggest that more measures be taken to protect citizens at the upstream level.

Thus, we must move forward with novel ideas to manage these risks. We believe that these come in the form of new laws and policies to heighten cyber security and increase the protocols for institutions that maintain massive genetic information databases, such as the genetic testing companies that are at risk for the exposure of millions of genetic records. The need for such a detailed framework is apparent, as well as the need for consequences when institutions leave their trusting customers unprotected. In summary, we propose three solutions for policy makers and private institutions to consider:

  1. General-use genetic database cybersecurity frameworks that are consistently tested for potential holes. There should exist several accepted frameworks to protect massive amounts of genetic data in varying fashions, such that it’s not a “one hack breaks all” system. These frameworks must be supplied by government contracted companies and strive to maintain the highest degree of protection.
  2. Heightened consequences for handlers of genetic data after leakage occurs. This potentially includes required settlements for customers and heavy fines.
  3. Heightened consequences and tracking for hackers of genetic data. While cyber security laws certainly exist already to punish those who illegally breach private information, we believe that the effort put forth to discover and punish those who exploit leaks must be enhanced.

From this discussion of what’s in place, as well as how we believe the U.S. should move forward, we can extract a best, likely, and worst-case scenario for the future in regards to the handling of genetic information.

  • Best-case: we exist in a world of incredible healthcare, a time of overwhelming discovery, and very few cases in which the identity of individuals is harmed by data loss. We exist in a society free from angst concerning the protection of our data as it lies in the hands of unknown individuals; we are confident in its maintenance due to the low numbers of reported breaches. Policy makers have been highly preemptive in their thinking.
  • Likely-case: we exist in a world of tolerance; we tolerate genetic data losses and manage them as they occur, similar to the current approach during this time of relatively low numbers of exposures compared to our predictions for the future. For this to occur, some preemptive action must occur.
  • Worst-case: attempts to remedy the issue in the form of policies and punishments are made too late; the markets for data are too large and lucrative, and we sink into a war against data distributors that is seemingly unmanageable. A significant portion of individuals are discriminated against due to their genetic composition, and basic housing, loan, employment, and insurance rights are no longer guaranteed. To recover, we see that more effort will be needed to equalize the playing field and restore equilibrium. Any time a need for strong government action arises, things often become highly volatile.

The key to this discussion is the idea of preemption. We know what exists in regards to the state of big data, technology, and privacy; we can infer what is to come. We must act upon these inferences now, even if these actions place restrictions on the potentially beneficial aspects to our current state of data management, simply for the sake of the greater good. While the promise of big data in genetics is overwhelming, there are certainly ethical dilemmas that must be in the forefront of our minds moving forward. The privacy and freedom of all individuals is at the root of our culture, life, and country. As this topic presents a danger to these freedoms, action must be taken sooner, rather than later.

We live in a time of significant change and progression. Thinking ahead, just slightly, can protect the safety and health of millions.

Resources

  1. Boston Children’s Hospital accelerates genomic sequencing to expand existing genomic database. (2018, August 7). Retrieved April 23, 2020, from https://www.eurekalert.org/pub_releases/2018-08/bch-bch080618.php
  2. New Preventive Genomics Clinic Launches at the Brigham. (2019, August 19). Retrieved April 23, 2020, from https://www.brighamandwomens.org/about-bwh/newsroom/press-releases-detail?id=3410
  3. Chen, A. (2018, June 6). Why a DNA data breach is much worse than a credit card leak. Retrieved March 29, 2020, from https://www.theverge.com/2018/6/6/17435166/myheritage-dna-breach-genetic-privacy-bioethics
  4. MyHeritage Statement About a Cybersecurity Incident. (2018, August 8). Retrieved March 29, 2020, from https://blog.myheritage.com/2018/06/myheritage-statement-about-a-cybersecurity-incident/
  5. Grant, N. (2019, July 9). DNA testing service Vitagene exposed thousands of customer records online for years. Retrieved March 29, 2020, from https://www.latimes.com/business/la-fi-vitagene-dna-privacy-exposed-20190709-story.html
  6. Landi, H. (2019, August 23). Massachusetts General Hospital privacy breach exposed 10,000 patients’ records, genetic information. Retrieved March 29, 2020, from https://www.fiercehealthcare.com/tech/massachusetts-general-privacy-breach-exposed-10-000-patients-records-genetic-information
  7. Clement, J. (2020, March 10). U.S. data breaches and exposed records 2019. Retrieved April 6, 2020, from https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-and-records-exposed/
  8. Clement, J. (2020, March 10). U.S. data breaches by industry 2019. Retrieved April 6, 2020, from https://www.statista.com/statistics/273572/number-of-data-breaches-in-the-united-states-by-business/
  9. Health Insurance Portability and Accountability Act. Pub. L. №104–191, (1996)
  10. Genetic Act, S. 358, 110th Cong. (2007)
  11. Coller, Barry S. “Ethics of Human Genome Editing.” Annual Review of Medicine, Vol. 70, Jan. 2019, pp. 289–305., doi:https://doi.org/10.1146/annurev-med-112717-094629.
  12. Donaldson, Molla S., and Kathleen N. Lohr. Health Data in the Information Age: Use, Disclosure, and Privacy. National Academy Press, 1994.

--

--