Turning Data Around

Nov 18, 2016 · 12 min read

On a September morning in 2013, the students at Hunter College High School filed past security and into the hallways, to find that their school had been labeled the saddest place in Manhattan.

Five months earlier, researchers in Cambridge, Massachusetts had pulled more than six hundred thousand tweets from Twitter’s public API and fed them through sentiment analysis routines. If a tweet contained words that were deemed to be sad — maybe “cry” or “frown” or “miserable”, an emotional mark for sadness would be placed on a map of the city. As more tweets classified as sad happened near a specific area, more marks would be placed on the map, and more sadness would be ascribed to that particular place. The result was a kind of sedimentary layer of emotion on top of New York City. If you looked at the map that came out of the study, you’d probably ask about a deep purple spot of lament just to the right of Central Park’s reservoir; if you looked that spot up in Google Maps, you’d find that it sat right on top of HCHS. If you then thought about when the tweets had been collected (in the spring), you might arrive at a hypothesis similar to New England Complex Systems Institute’s president, Yaneer Bar-Yam:

“I checked the high school calendar and found that the spring vacation period in 2012 was April 9–13, so that students would be returning to school on the 16th, just during the period of the data collection, April 13–26, 2012. This provided a rationale for the low sentiment there.”

Sad students returning from spring break — it seemed to Bar-Yam like an interesting finding. The press agreed. Stories about the study, and about this sad Upper East Side school appeared in Nature, then in The New York Times. In advance of their story, the city’s newspaper of record dispatched a reporter to talk to the students at this “Saddest Spot in Manhattan”, to gauge their reaction to the study.

“I mean, I can see why it could make sense,” one fourteen-year-old student told the reporter. “The school has no windows, so being inside can seem dark and depressing. And some kids do get stressed out from the workload.”

Unwittingly, the staff and students at Hunter College High had found themselves inside a microcosm of the “Big Data” world that we’ve all been made to inhabit. It’s a world in which we are all being data-fied from a distance, our movements and conversations processed into product recommendations and sociology papers and watch lists, where the average citizen doesn’t know the role they are playing, surrounded by machinery built by and for others.

It’s a world that flows in one direction: data comes from us, but it rarely returns to us. The systems that we’ve created are designed to be unidirectional: data is gathered from people, it’s processed by an assembly line of algorithmic machinery, and spit out to an audience of different people — surveillors and investors and academics and data scientists. Data is not collected for high school students, but for people who want to know how high school students feel. This new data reality is from us, but it isn’t for us.

So how can we turn data around? How can we build new data systems that start as two-way streets, and consider the individuals from whom the data comes as first-class citizens?

  1. Design data systems for the well-being of the people from whom the data is taken

Of all of Big Data’s oversold promises, perhaps the most dangerous is that it is passive. It’s very easy to get carried away with the long-distance magic of APIs and machine learning, to use these technologies to scan from afar. But our scanning isn’t harmless; our analyses not without effect. Those high school students that you’ve classified as sad feel something when they read the results of your study.

One thing that we can to turn data around is to start our work by saying these three words aloud: consider the humans. Every part of a data system — the mechanisms for collection, the storage and parsing machinery, and the modes of representation, should be designed and built with two central questions in mind: How might my work benefit the people from whom the data came? How could my work harm those same people?

We can direct these questions toward the methods in which data is gathered and computed upon. Do I have permission to collect the data, and to use it in the way that I intend? It’s true that Twitter’s blanket user agreement grants you and I and anyone else the right to read a person’s tweets, but it’s an entirely different ethical act to label a high school student with an emotion, then to publish this in a public forum. Perhaps terms-of-use agreements need not only to cover how much data you can download, but also in which ways the data can be used and presented. Until then, though, it’s up to us as authors and architects of data systems to be critical about the what we’re collecting and what we are doing with it.

We can also point the same questions about benefit and harm at representation. Data visualizations might seem inert, but there are many ways they can cause harm. A visualization might bring unwanted attention to a person or a group (or a high school). A map can trivialize human experience, by reducing a life to a dot or a vector. Representations of violent or tragic events can be traumatic to people that were directly or indirectly involved. We’re well trained to be aghast at a truncated Y-axis or an un-squared circle; we need to expand our criticality to include the possible social impacts of “well-made” visualizations.

2. Wherever possible, provide mechanisms for feedback

When the celebrated urbanist Jane Jacobs was considering the sad state of cities in America in the late 1950s, she realized that there weren’t many working mechanisms for feedback. She saw that the processes of the modern city could and did run out control, because the outputs of these processes were not tightly tied to the inputs. In particular she recognized that financial success tended to reduce neighbourhood diversity, and that there was no mechanism in the city to control this decline. Neighbourhoods followed a downward diversity spiral until they were bleak and unproductive shadows of their former selves. Data systems that don’t directly engage with the individuals and communities from whom the data came carry similar risks: without mechanisms for correction they will tend to careen off in destructive directions.

We might start by asking: who has agency in the systems that we are creating? As authors, we certainly do: we choose from whom we collect data, which pieces of information we collect and exclude, what algorithms we use to process the data, and how it is represented. Are any of these instruments for control offered to the people who reside inside the structures that we design?

I’ve often said that the true medium of the data artist, or the data visualizer, or the data analyst, is the decision. Each decision that we make — which colour to use or which rows to exclude or which sentiment analysis algorithm to choose — changes the work fundamentally. Each time we make a decision we place our project into a completely different possibility space, changing the way that it functions and the way that it can be received.

In The Death and Life of Great American Cities, Jacobs not only pointed at the need for feedback, she wondered how it would manifest. “What can we do with cities,” Jacobs asked, “to make up for this omission?” With data systems, where agency is so closely tied to the choice, we can provide mechanisms for feedback by giving people the power to make decisions. In granting this ability to decide, we implicitly tell the people living in data that they are active citizens.

3. Honor the complexity of individual and community realities

In 1947 John Kirtland Wright, a librarian, gave the presidential address to the American Geography Society. In his speech, he coined a term meant to act as a counter to the very word geography, and what it had come to represent, both in the field and to the general public. Where geography was compounded from the Greek words for “earth” and “description”, Wright’s new word, geosophy, translated to “earth knowledge”. This new field, he proposed, would be focused “on the results that knowledge produces on the face of the earth, rather than on the geographical nature of knowledge itself.” He went on to say that geosophy would move past scientific geographic knowledge to consider “the geographical ideas, both true and false, of all manner of people — not only geographers, but farmers and fishermen, business executives and poets, novelists and painters, Bedouins and Hottentots.” While the information of geography came from objectivity and precision, the wisdom of geosophy would stem from subjectivity, and deep consideration of lived experience.

It’s true that terms like “subjectivity” and “lived experience” seem in opposition to our accepted data philosophies. As the makers of data systems, our deeply entrenched Tuftean ideals put us tightly in line with the geographers. We seek the truth through precise measurement and steady denial of biases. The conservatism of this approach, though, leaves us focused on a few selected ways of speaking to a very particular audience.

As the builders of data systems, I believe we have been deeply lacking in imagination. As Wright might say, we’ve been “thickly encrusted in the prosaic,” too busy exploring what we can do with the tools in our hands to think about what others may do with those same tools. Or to consider what kinds of tools others might create were they given the means. Our lack of imagination has made it hard to envision how others might live in the systems we’re creating. By adopting a “datosophy” approach, and embracing subjectivity, we might find that data becomes a tool not only for reduction and generalization, but for empathy and understanding.

4. Create real, functioning data publics

Of the various adjectives that we’ve tacked onto the word data, one of the most common is “public”. If we’ve become more aware in recent years of the pervasive dark side of data systems, public data is often held up as the light. We even hear it offered to us as a reasonable reward for all of the sticky, painful, and downright destructive effects of the big data constructs we’re made to inhabit. And yet, most public data is not public, in any real way.

Think first about the White House, which is public, but surrounded by a tall fence and patrolled by armed guards. You might get to go inside on a brief tour, but unless you’re extremely lucky or the president, they probably won’t let you wander around. Now consider a library, which is public because it is free and accessible to anyone. There’s a wheelchair ramp by the front door, there’s a TTS line to talk to a librarian, there’s a school group walking in the front door. Placed on this library → White House axis, I’d argue that almost every public data project we’ve been building has a tall iron gate in lieu of a open door; a security guard standing in the place of a librarian.

Our classic fora for data — the research report, the textbook, the data blog, the policy paper — are by their nature exclusive. To manifest real data publics we need to place data in real, functioning public places. When is the last time you saw a bar graph in a park? Read a scatter plot in a public square? Listened to a sonification in museum?

By bringing our work out into shared spaces, we can force data into a new public role. We can also drag our work past the tech elite that we are used to as an audience, to put it in front of what John Kirtland Wright would have called “all manner of people.”

On an unseasonably warm November morning last year, I stood in the lobby of Hunter College High School, waiting to speak with Lisa Siegmann, the school’s assistant principal. I was interested in how the New York Times article had been received by students, and what the aftermath (if any) was of being so publicly labeled as the saddest spot in the city.

Actually, I wanted her reaction to the school being labeled as sad and then unlabeled three weeks later. As it turned out, the researchers from New England had made a big mistake. Their geocoding code, the part that turns a place name or an address into a point on a map, was faulty. Hunter High was not the saddest place in the data set, it was merely sad-adjacent, unlucky to be located nearby a single Twitter account that had been posting a lot of content that was getting labeled as unhappy. If that wasn’t bad enough, the scientists had missed a deeper error in their “sad high school” hypothesis that left the premise completely indefensible.

The assistant principal was running late, so I stood by the security desk with a small group of parents and waited. I couldn’t help but look at the students that walked by and try to assess their emotional state. When Siegmann arrived she led me upstairs, past a small lineup of students waiting to see her, and into her office. Siegmann seemed wary to dredge the article up again, but she was candidly direct about how she and the school had reacted.

“Nobody believed the article,” she said. “First, this is not a sad school,” she added, taking a minute to explain the various activities that the students participate in and the awards that the school and the students had won.

“Second,” she said, “no one in this school uses Twitter.”

As it turns out, HCHS doesn’t permit students to use social media while they’re in the building. The administration knows that students are breaking the rules, and they also know which social platforms they are surreptitiously using: Snapchat, Facebook, and Instagram. But not Twitter. Twitter, as Siegmann explained to me, “is for old people.”

The Hunter College High fiasco is a perfect example of how data systems can fail end-to-end: a retracted story about a false premise, fed by a faulty algorithm, feeding on bad data. All balanced on top of an impossible premise. It’s also a reflection of the kinds of big data stories that we’re so eager to believe: where a large data set combined with novel algorithms shows us some secret that we would not otherwise have seen.

What brought me to the principal’s office that morning though, was not to find another way to critique a study or to place blame on the researchers or their broken algorithms. I was there because I can remember so clearly what it was like to be in high school; to be vulnerable and afraid and powerless. How being labeled could feel like being struck. How nothing seemed to be under my control, and no one seemed to hear my voice. I can also remember how I found agency in the face of all of that overwhelming possibility through, of all things, computer programming. How could it be that the very same thing that offered me escape thirty years ago was now being used to make the lives of high school students worse?

On the subway home, I resolved to double-down on The Office for Creative Research’s data humanist mission. This year we’ve released a citizen science tool that allows 10,000 chronic pain sufferers to make and share hypotheses about pain and weather correlations. We built a sixty-foot-long walk-through histogram of public data in front of Manchester’s town hall. Next year we’re releasing a new version of Floodwatch, a tool that allows people to monitor their exposure to web advertisement, and to donate their data to researchers investigating discriminatory practices. In February, we’re taking over an abandoned school in one of St. Louis’s most poverty-stricken and racially divided neighborhoods to make the Map Room, a community space for map-making, data exploration and dialogue.

We’re doing all of this because we believe here is a better way forward with data. To find it, we’ll have to leave the utilitarian rhetoric of “Big Data” behind, and replace it with Human Data. We’ll need to deconstruct the systems we’ve created and rebuild them so that they no longer flow downstream from people and communities, but upstream towards them. In doing so, we can help to author a new data world that is liveable for everyone.

This essay will be published in The Office For Creative Research’s quasi-annual journal, which is available for pre-order now.

Memo (random)

Missives from the many-folded boundaries between data, art and culture.


Written by


Jer Thorp is an artist, writer & teacher. He is Innovator-in-Residence at the Library of Congress. His book Living in Data will be published in 2020 by MCDxFSG.

Memo (random)

Missives from the many-folded boundaries between data, art and culture.