You Say Data, I Say System

In the fall of 2009, I wrote a pair of algorithms to place nearly 3,000 names on the 9/11 memorial in Manhattan. The crux of the problem was to design a layout for the names that allowed for what the memorial designers called ‘meaningful adjacencies’. These were requests made by next-of-kin for their family members to appear on the memorial next to — or as close as possible to — other victims. Siblings, mothers and daughters, business partners, co-workers, these connections represented deep affinities in the real world. There were nearly 1,400 of these adjacencies that a layout of the names would ideally honour.

In December of that year, I flew to New York to meet with some of the project’s stakeholders and to present the results of the algorithms that I’d developed. I came into the meeting disheveled and nervous. Disheveled because I’d flown into La Guardia that morning, having spent much of the plane ride revising and re-revising my presentation. Nervous because I had found out the day before that another team had also been working on the layout problem; a group of financial analysts (‘quants’) who almost certainly all had at least one PhD.

It must’ve been a strange sight. A small army of besuited financial professionals, across the table from a long-haired artist from Canada with an old, broken laptop. The quants went first: they’d run permutation after permutation on their server clusters, and they were confident they’d found the optimal solution for the adjacencies: a maximum about 93 percent of them could be satisfied. They’d asked to speak first because they wanted to ‘save us all some time’, since they knew, mathematically, that they had found the most highly optimized solution.

It was a persuasive argument. I let them finish, then I turned my laptop around on the table to show them a layout that I’d generated about a week before — one that was 99.99% solved.

The lesson here is not ‘don’t get a math PhD’. Nor is it (specifically) ‘hire a long-haired data artist from Canada’. The lesson is to not look just at the data, but at the entire system that the data is a part of. Taking a systems approach to data thinking allows you not only to solve problems more efficiently, but to more deeply understand (and critique) the data machinery that ubiquitously affects our day-to-day lives.

An over-simplified and dangerously reductive diagram of a data system might look like this:

Collection → Computation → Representation

Whenever you look at data — as a spreadsheet or database view or a visualization, you are looking at an artifact of such a system. What this diagram doesn’t capture is the immense branching of choice that happens at each step along the way. As you make each decision — to omit a row of data, or to implement a particular database structure or to use a specific colour palette you are treading down a path through this wild, tall grass of possibility. It will be tempting to look back and see your trail as the only one that you could have taken, but in reality a slightly divergent you who’d made slightly divergent choices might have ended up somewhere altogether different. To think in data systems is to consider all three of these stages at once, but for now let’s look at them one at a time.

Collection

Any path through a data system starts with collection. As data are artifacts of measurement, they are beholden to the processes by which we measure. This means that by the time you look at your .CSV or your .JSON feed or your Excel graph, it has already been molded by the methodologies, constraints, and omissions of the act of collection and recording.

The most obvious thing that can go wrong at the start of a data system is error, which is rife in data collection. Consider the medical field: A 2012 study of a set of prestigious East Coast hospitals found that only 3% of clocks in hospital devices were set correctly, meaning that any data carrying a timestamp was fundamentally incorrect. In 2013, researchers in India analyzed results from the humbly analogue blood pressure cuff in hospitals and clinics and found the devices carried calibration errors in the range of 10% across the board.

These kinds of measurement errors are pervasive, inside of hospitals and out. Errors may be unintended, the results of mis-calibrated sensors, poorly worded surveys, or uncounted ballots. They can also be deliberate, stemming from purposeful omissions or applications of heavy-handed filters or conveniently beneficial calibrations.

Going further back from how the data is collected, you should also ask why — or why not. Artist and data researcher Mimi Ohuoha, whose practice focuses on missing data, tells us that the very decision of what to collect or what not to collect is political. “For every dataset where there’s an impetus for someone not to collect”, she writes, “there’s a group of people who would benefit from its presence”. Onuoha neatly distilled the importance of understanding collection to the understanding of a data system as a whole in her recent talk at the Eyeo Festival in Minneapolis: “If you haven’t considered the collection process”, she stated neatly, “you haven’t considered the data.”

Computation

After collection, data is almost certainly bound to be computed upon. It may be rounded up or down, truncated, filtered, scaled or edited. Very often it’ll be fed into some kind of algorithmic machinery, meant to classify it into meaningful categories, to detect a pattern, or to predict what future data points from the same system might look like. We’ve seen over the last few years that these algorithms can carry tremendous bias and wield alarming amounts of power. But this isn’t another essay about algorithmic bias. There are many other aspects of computation that should considered when taking the measure of a data system.

In Jacob Harris’s 2015 essay Consider the Boolean, he writes about how seemingly inconsequential coding decisions can have extraordinarily impact on the stories our data might ultimately tell. Harris proposes that the harsh true-false logic of computation and the ‘ideal views’ of data that we endeavour to create with code are often insufficient to represent the ‘murky reality the data is trying to describe’. Importantly, he underlines the fact that while computational bias can come from big decisions, it can also come from small ones. While we urgently need to be critical of the way we our author machine learning systems, we also need to pay attention to the impact of procedural minutiae — like wether we’re storing a data point as a boolean or a string.

Representation

As you’ve seen, the processes of collection and computation are rampant with decision points, each of which can greatly increase or greatly limit the ways in which our data systems function. When we reach the representation stage, and begin to decide how our data might tell its story to humans, possibility space goes critical. Each time you pick a chart type or colour palette or a line weight or an axis label, you’re trimming the possibility space of communication. Even before that, the choice of a medium for representation has already had a predestinatory effect. A web page, a gatefold print, a bronze parapet — each of these media is embedded with its own special opportunities, and its own unavoidable constraints.

Whatever the medium, many of the points that Mimi Onuoha makes about collection can be mapped directly to visualization: questions about what is shown in a visualization and how it is shown must be paired with questions about what isn’t shown and why someone has chosen not to show it. In a quest to avoid the daunting spectre of bias, data visualization practitioners often style themselves as apolitical. However, the very process of visualization is necessarily a political one; as I’ve said for years to my students at NYU, the true medium of data visualization is not color or shape; it’s the decision.

By being mindful of the decisions we’re making when we’re authoring visualizations we can make better work; by seeing these decisions in work made by others we can be more usefully critical of the data media that we consume.

The problem with the way that most of us operate within a data system is that we’ve designed our roles such that we can almost never see the whole thing from where we are. Those who are tasked with collecting the data are rarely involved in its representation. Conversely, the visualization professional sits so far away from measurement that most times the nuances of how the data was collected are completely lost. No matter where you might reside on the collect/compute/represent continuum, it will do you service to stand on high ground, to stretch your vision as far as you can towards the opposite edge.

Which brings us back to the memorial, to the algorithm, to all of those names.

Where the quants failed, I think, was in not considering the physicality of the memorial itself. They looked at the data, but not at how the data was to be represented. While their model generalized the problem, considering each name as an equal unit in a simulated system, mine considered each name as a unique unit in a real system. My model rendered each name using the typeface in which the name would ultimately be inscribed in bronze. It considered the half-inch expansion joints between the memorial’s parapets, and how the individual characters in each specific name might allow it to cross that expansion joint neatly. It included the oddly-shaped triangular corners of the memorial along with the long rectangular parapets. While each of these strange characteristics of the physical memorial might seem like a constraint, they also gave my system an elasticity that couldn’t exist in a simplified model.

Since my work on the memorial, I’ve tried to cultivate a data systems approach in all of the work that I’ve done with my many collaborators. I’ve gone deep into systems of collection by building sensor stations on glaciers, narrowly escaping hippo attacks and riding in submersibles to the bottom of the ocean. I’ve built web tools and machine learning systems to compute upon billions of records of web advertising placements to provide means for collective action against discrimination. I’ve explored the outer limits of data representation through sound, sculpture, performance, and participatory practice.

But embracing a data systems approach doesn’t need to be so involved; it can be a simple act of word replacement. The next time you read a story with the word data in the headline, swap it out with data system. When you see a data visualization, think of it instead as a data system visualization. If the government proposes new policies around personal data, think about them instead as policies about people, and the data systems which they inhabit. Widening your thinking in this fashion will also allow you to engage in broader criticism of data systems and those who are authoring and exercising them.

After all, it’s not enough for us to be critical of Uber’s booking algorithm, or FOX News’ most recent infographic. We need to expand our attention to the systems that these mechanisms support; systems in which our participation is often both transparent and involuntary. By taking a systems approach to data I believe we can make better things. And we might also find deeper and more meaningful questions– questions that are as much about how these things work (or don’t work) as why they exist in the first place.

The images in this post are by Harry Fisk, and were made in 1944. They show the historic meanderings of the Mississippi river. They are the result of years of research, and are, as Kyle Hill writes “a combination of speculation, interpretation, and extrapolation”. Find out more about them here: http://nerdist.com/harold-fisks-incredible-maps-track-the-ghosts-of-the-mississippi/