Transparency in Data Science

On Trusting Machines

Galen Buckwalter, PhD

Published in

payoff

9 min readMay 20, 2016

Dr. David Herman and Dr. J. Galen Buckwalter

As data scientists analyzing big data using machine learning becomes increasingly commonplace in our daily lives, it’s incumbent on us to define what guides our decisions. Along the entire data processing chain, from collection to management to analytics to interpretation, we have a responsibility to define and share our goals and findings with those from whom we collect and analyze information.

The issue here is not solely about what guides our decisions as to what data to capture, but also, critically, decisions about what we do with data once we have captured and analyzed it.

I, your data scientist/bartender use predictor variables based on information about you that you provide about your personality, life context, taste preferences, along with your mood, enabling us to give you what you want — be it your usual drink or variations on it.

How and what we predict certainly matters, but it’s what we divulge about the data generated by a person, how we let them interact with and change it and how we share that data with others that is our current focus. This is the most critical issue where the consumer is concerned.

The People Behind the Machines

Data scientists are focused on description and prediction: we want to understand people and from that, predict and provide what they want and need. From a scientific perspective, we want to know people better so we can improve our predictions, and thus improve what we provide to you in terms of products and services. This is why we ask you questions and ask you to give us access to your social media accounts: we want to give you curated, personalized information and services that are consistent with your purpose and goals in life, all as a means of enriching our lives.

Imagine us as the bartender at your local bar.

When you walk in, I hope to not just remember your usual drink, but as your data scientist/bartender, I hope to be able to understand the myriad factors that determine what drink may please you on that particular day. Of course, for me to do this, I’d love to know your mood in the context of all you have done that day as well as what remains on your calendar for the evening. I, your data scientist/bartender use predictor variables based on information about you that you provide about your personality, life context, taste preferences, along with your mood, enabling us to give you what you want — be it your usual drink or variations on it.

What we will soon come to expect as well are machines that learn about us personally from iterative interaction, getting to know us better as we change.

With enough information about you — what you drank over the last year, your current glucose levels, how that relates to taste preferences in millions of other people, all analyzed with the appropriate machine learning algorithm, and I may be able to give you something entirely new that you might never have thought of on your own. Hopefully, it enhances your day. This is in part how discovery works at this point, with technology at our sides.

On Expectations

As we move beyond curated experiences, we begin to anticipate your next need by collecting all manner of data along the way that informs our recommendations, based on what we’re learning about you. In addition, we build insights about relationships within your data — your behaviors — that are not immediately obvious. These insights lead to curation of new things we predict you will like or will engage you, adding depth and breadth to your experiences.

From a user’s perspective, we expect our needs to be intuited at this point, whether we realize it or not. An increasingly technically savvy population is no longer impressed by the fact that companies can pop up advertisements for products similar to the last page we visited; at the very least, we expect to be guided beyond our last experience of creating data toward something new. Nor do we want to do a search as if it were our very first search, or look for a restaurant and have to wade through volumes of comments from people who seem to share nothing in common with us. We want our next move to be guided by who we are as individuals and our past behavior, coupled with recommendations that will actually benefit us.

But it doesn’t stop there. What we will soon come to expect as well are machines that learn about us personally from iterative interaction, getting to know us better as we change.

The Paradox of Transparency

In order for data scientists to provide people with choices that reflect our core essence as a person, including our unique characteristics, traits and beliefs, our goals for life and for the moment, scientists need more data than the simple click behavior that continues to be the gold-standard for most consumer-focused marketing big data efforts. Data is the life-blood of online science and with the advent of social networks for pictures, ancestry, family, employment, love, health and more, the data is covering more and more areas about each of us with a level of depth never imagined even a few years ago. Data is changing from clicks and likes to deeper behaviors and insights into our lives and psychology.

But here’s the dilemma: the user wants the accuracy, comprehensiveness and personalization that comes with sophisticated analyses of the data they have provided, and in exchange for the products and services of various industries in action, users want transparency. As data scientists, we want to provide usable results from our analyses and we want access to every bit of relevant data that exists, to better understand the user. Having it both ways is difficult: complete data leads to comprehensiveness, but this requires deep visibility into people’s lives.

If data science is trying to learn what people care about to get them to engage through traditional click behavior, as data scientists, we think it’s worth noting that some of your clicks don’t necessarily guide you to better places, and in fact they can lead us down a path we may not want. From impulsive purchases to depressing additions to your Pandora feed, technology introduces temptations. Resisting temptation can be hard.

Given that algorithms already power hundreds of interactions for each of us every day, through our clicks and other actions on our computers, phones, homes and cars, guiding us toward experiences designed by companies with little or no transparency into their motivations, people often end up sharing more than they realize and are perhaps guided toward behaviors that aren’t in their best interests.

These systems are designed to streamline our lives and improve our environment and are creating more data by the second, but there is no agreement among data scientists as to how and when — even between scientists — our findings are shared.

We believe that data and information are very powerful tools, belonging to the person whose movements created the information that data scientists use. But obviously, there is ample space for unscrupulous use of all of our information.

Where it Works Well

There are some areas in our lives that are attempting to regulate the way our data is shared and analyzed. For example, with health data, we already have some privacies in place. The Health Insurance Portability and Accountability Act, or HIPAA, requires the protection and confidential handling of protected health information from consumers. The Consumer Financial Protection Bureau, or CFPB, requires financial institutions to provide an initial privacy notice to us, when they establish a customer relationship. It even goes a step further, mandating that another copy of the privacy notice be sent each year. Though these protections are in place, nothing is guaranteed to be 100% reliable.

It seems unlikely the regulatory systems will be able to keep pace with the speed at which systems capture data containing information about us. Frankly, at this time in the development of data science, things are moving faster than a governing body can possibly move. The flow of data we can aggregate increases exponentially every day and this human behavior is not bound by many restrictions. In this context how can we ensure personalization for users while protecting all of our privacy?

Data scientists need to act ethically with data, regulatory systems need to continue to grow with the changing technological landscape and consumers need to have explicit rights with regard to how their data is used and shared.

We believe every data scientist accepts the ethical responsibility to treat data as an extension of the person whose behavior created the information. Responsible data scientists ought to, along with physicians and psychologists, vow to do no harm to the person whose data they use. Like the Hippocratic Oath adheres doctors to the best interests of their patients, a data science oath could perform the same function.

The point is transparency, and if we are using the data for the purposes clearly outlined in the “Terms and Conditions” of any relationship, there should never be any trouble or concern. But users don’t always know that data is being collected about them and friction is resulting from people sensing that their movements are being analyzed for information and in the process, their privacy violated.

This is especially relevant as new areas of focus on data emerge daily. If the scientist has found a brand-new use for the data, even one that seems to be without risk and with only benefits for the user, the data continues to be the property of the user and they should be informed of the new use.

While fully empowered and informed consumers are ideal, it’s pragmatically unlikely that we as consumers would or could ever want to be so mentally involved with every use-case of their data. Much like we often find new uses for ideas and things that surround us in the non-digital world, new ways of interpreting and using data are equally frequent. Big data regulation would begin to address this, unburdening individuals who would otherwise be constantly confronted with queries about the use of their data for a new application.

If we’re going to accept that data science is at the foundation of many of our institutional systems — and it is and is growing in value with each click — we must know that the transparency of the process is at the core.

As a result, data scientists are charged with doing double duty: collecting, analyzing and recommending, while communicating their goals and remaining open to the questions of an increasingly digitally engaged community. Through a combination of both privacy and transparency, good data scientists are building trust so we can begin to treat machines the way we treat people, and interact with them honestly so they provide truly valuable content for us to work with.

Ultimately, the “solution” to our big data concerns is multifaceted. Data scientists need to act ethically with data, regulatory systems need to continue to grow with the changing technological landscape and consumers need to have explicit rights with regard to how their data is used and shared.

It All Starts With Communication

Educating the public about the valuable role data is playing in their lives is part of the mandate of good data scientists. This requires us to reveal our objectives and connect those to the benefits to people. As with so many things, good communication is at the heart of what we do.

There is a win-win in this relationship and it can be proven in the transparency of what we’re doing. When we share our findings, people can see how our work enables their decisions, rather than dictates them.

In order for data science to continue to be as effective as possible in the future, it needs to be as valuable to individuals as it is to scientists, so that both sides can be confident it’s accurate and fair. The truth is central to trust, and honesty is at the core of this. If we don’t trust our interactions with the machines around us, both algorithms and robotics will be unable to provide us with the accurate information needed to power these advancements. Without trust, we won’t accept these powerful tools in our lives and both the data and the experiences will be of no use to us.

In order for data to be transparent, we’re going to need to think of data science as more of a dynamic system integrating with users, rather than just gathering information and providing recommendations and assessments. This system will be required to interact with us in much the same way as humans communicate with one another, crossing the spectrum of human communications: incorporating our behaviors, the world we see with our eyes as well as the vocal and written communications we engage in every single day.

Algorithmic systems provide value and they need to engage us in the same way we do when we interact with other humans, exchanging ideas and beliefs creating trust in the system, much like we trust one another. This is what will provide truly valuable experiences so we can embrace the technological, personalized age we are already in, and leverage these machines to help us change our lives for the better, as opposed to feeling judged by them.

The future need not be the Orwellian judgment of Big Brother, it can be a supportive friend who is actually looking out for us, not trying to ensnare us. We aren’t losing autonomy, we will be gaining options, as they will be enhanced by people whose aim is to streamline and improve our digital interactions.

Transparency in Data Science

On Trusting Machines

Written by Galen Buckwalter, PhD