Know Your Enemy, or Ethical Aspect of Sensitive Data Collection
I recently started a Data Science Immersive course with General Assembly in New York, and have been exposed to limitless streams of new information on a daily basis. One of many remarkable lessons was a lecture on ethical issues in data science given by Wes Bosse (General Assembly Local Lead in Los Angeles). The issues with data collection and processing ethics are quite complex, having risen rapidly and encompassing a full spectrum of conflicting opinions.
Some aspects of data collection make many people raise their brows though they are incredibly common, like the collection of birth dates or gender. I cannot stop thinking about a specific example Wes presented on. The COMPAS recidivism rate analysis algorithm(here is the source https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm) used demographic data as a part of repeated crimes prediction as a part of routine criminal court procedures assessing the ongoing cases. It turned out the algorithm was seriously racially biased as it allocated defendants of color (i.e. black) with much higher “predicted” recidivism rates, while giving white defendants much lower recidivism rate predictions.
The students were asked whether or not this algorithm was ethical. Overall, the audience was significantly skewed towards deeming the algorithm unethical. Wes followed up with hypothetical questions revolving around if inclusion of additional types of demographic information (e.g. zip codes, etc.) would make the use of COMPAS more ethical, it did not. I was quite puzzled by a number of opinions explicitly stating that the mere collection of demographic information is very unethical and this type of information should not be amassed at all. While I can understand the need to be cautious, this point of view seems to be indeed a bit short-sighted to me.
I recently came across a brilliant article by an Oxford University researcher named Reuben Binns (http://theconversation.com/its-not-big-data-that-discriminates-its-the-people-that-use-it-55591) who’s main points are that (big) data can not discriminate against anyone by itself — it is people and entities who misuse it. He wrote so elegantly, I cannot resist the temptation to cite some of his conclusions: “So the argument that sensitive attributes should be stripped from the datasets we use to train predictive models is too simple. Of course, collecting sensitive data should be carefully regulated because it can easily be misused. But misuse is not inevitable, and in some cases, collecting sensitive attributes could prove absolutely essential in uncovering, predicting, and correcting unjust discrimination.” In the other words, we can not judge a gun for the actions of the person firing it. Moreover, a gun paradoxically can be a valid peacekeeping tool.
I would also like to make another example of racial data use — this time from my own experience. I know someone who was nearly accidentally diagnosed with a rare form of extremely malignant cancer. The prognosis was a life expectancy of about 2 months, if left untreated. He decided to start an ultra-aggressive chemotherapy regimen, as it was his only chance. After the first round of treatment, he was given a blood-freezing verdict that the treatment had not worked. Unbeknownst to this person, only his doctor’s curiosity eventually saved his life. It turned out his response to one of the main therapy components was atypical, and very similar to a reaction frequently observed in patients of Hispanic descent. A simple switch to an analog of a drug he was already taking was all it took to get him back onto a long and very much troubled recovery path. Without a data base with quite sensitive information and his doctor’s ability to access it, this guy would have likely not survived.
This guy was me.