Thinking Outside the Black Box
This week, I gave a talk at the O’Reilly Strata + Hadoop World Conference on the need for greater transparency in data science. Nearly every company we work with will draw on vast amounts of data and use advanced analytics to make sense of it, so this topic is incredibly important to much of what we do at Ekistic.
As many people know, the world is increasingly shaped by algorithms, a set of operations to be performed by computers when they receive certain inputs or triggers. These algorithms dictate the rates we are offered on car loans, the ads we see when browsing the internet, which movies are recommended to us on Netflix, and how many police officers are deployed to our neighborhoods. These algorithms are more powerful than ever thanks to the historic growth in computing power, real-time streaming data, sensors, and more accurate tools for analyzing historical data.
When they work, the impact can be transformative. Take, for example a recent program at Georgia State University. The school knew that many students, especially from traditionally underserved groups, were more likely to drop out before graduation. To identify the high-risk students, Georgia State implemented a Graduation and Progression Success (GPS) Advising Program, tracking over 800 risk factors for each student. Algorithms would then deploy simple, proactive interventions based on risk factors, such as making sure students were registering for classes that kept them on track to graduate.
The results have been fantastic. Over the past three years, Georgia State’s overall graduation rate has increased by six percentage points. The biggest gains have been seen by traditionally underserved groups. Most impressive is that, for the first time, low-income, first-generation, black, and Latino students graduated at rates at or above that of the student body overall.
This is a great example, but what happens when these algorithms don’t work?
Recent examinations of predictive intervention programs have revealed serious problems that generally fall into one of two categories:
- Bias in input data fed into algorithms
- Discrimination within the algorithm itself
A great example of these biases in practice can be seen in COMPAS, an algorithm developed by a company called Northpointe Inc. that provides risk assessment scores for recidivism. The scores have come under heavy scrutiny, as a ProPublica analysis found the algorithm was twice as likely to wrongly classify black defendants as future violent criminals compared to white defendants. White defendants were also misclassified as low risk more often than black defendants. Part of the reason for this failure may be due to the secrecy of the algorithms themselves. Algorithms are highly complex, and even small errors can have outsize impact. Without understanding the assumptions and inputs that make up algorithms, people (and governments) will be unable to evaluate their strengths and weaknesses. This carries huge risk, especially as algorithms are increasingly used to determine public health priorities, criminal sentencing, and more.
“Every secret creates a potential failure point. Secrecy, in other words, is a prime cause of brittleness — and therefore something likely to make a system prone to catastrophic collapse. Conversely, openness provides ductility.” — Bruce Schneier (2000)
So, how can we take advantage of the opportunities afforded by algorithmic decision-making, while avoiding the pitfalls?
We believe the path forward requires three things:
- Transparency (open algorithms and data)
- Awareness (interpretable, user-friendly tools)
- Proactive management (clear, transparent metrics)
We think the future will look like CrimeScape, a 2015 research project that provides hourly risk assessments for city-level violence, powered by open algorithms and data. The initial pilot looks at the link between weather and criminal activity, a link that academic research has established, but that is not currently accounted for by police departments. You can access the code behind CrimeScape on Github here. This is a transparent approach to smart police deployment that uses public data sets and community input.
Regulation and market forces are driving commercial analytics tools toward openness already, and soon transparency will be seen as a competitive advantage. It’s time to move away from the age of the black box, and toward a better data-driven future. This means showing your work, making sure inputs are well understood by all stakeholders, and soliciting feedback from more than just engineers.