Dataclysm — Or: How a Social Scientist Learned to Stop Worrying and Love the Quant


In Charles and Ray Eames’ iconic 1977 film Powers of Ten, the viewer traverses space by orders of magnitude, opening with a couple on a neat, rectangular picnic blanket and zooming out consistently by orders of 10 until we reach the fuzzy and dark depths of outer space. For many, this elusive, nebulous place is where Big Data resides — it’s out there, but it remains amorphous and unclear.

Christian Rudder’s Dataclysm: Big Data and The Stories It Tells aims to dispel this intergalactic mystery, putting Big Data squarely in the focus of the public telescope. Written by one of OkCupid’s founders, the book takes the reader through the dating site’s seemingly inexhaustible data (even incorporating more information from Google, Facebook, Twitter and other tech giants) to unveil hidden stories imbedded in huge data sets. It illuminates that Big Data is not antithetical to more “humanistic” qualitative research, but in fact, can be an ally in understanding the deepest of human thoughts and emotions. But as powerful and insightful as Big Data may be, its reductive modus operandi leaves many questions unanswered; questions that can be resolved if coupled with qualitative research.

For example, Rudder discusses how researchers using Google Trends found that searches for gay porn were consistent across the United States, hovering around 5% of all porn searches. While some individuals may be reticent to discuss their sexual orientation with a pollster, big data can provide insights that contradict stereotypes — like the assumption that most homosexuals live in big cities. It also points to intolerance in states, like North Dakota, where less than 2% of the population self-identifies as gay. Here Big Data serves to give us a more accurate sense of our nation’s population, circumventing any social stigma that may alter poll, survey or interview responses. The implications for research, public policy, and marketing are monumental.

But what is the lived experience of a gay person in North Dakota? From the depths of Big-Data space you can miss the nuances of earth. Big data can tell you about the probability of there being a picnic basket in any specific corner of the universe, but it can’t tell you about the conversation the (potentially gay?) couple is having.

Similarly, Rudder uses OkCupid’s user information and Google Trends data to reveal our complicated relationship with race in America. Analyzing OkCupid user’s attraction ratings, Rudder found that African Americans received the lowest attraction ratings among all racial categories — even when rated by fellow African Americans. Similarly, by analyzing the frequency of Google searches for “nigger,” the author is able to plot the ebb and flow of racial sentiments against the backdrop of President Barack Obama’s 2008 campaign, showing how searches including the word “nigger” rose alongside poll numbers and the ultimate prospect of a black president, and how eventually searches dropped after Obama’s inauguration in January 2009 (25% below the pre-Obama index). Here is one of Rudder’s best illustrations of a story hidden in numbers — a nation plagued with remnants from a segregated past, grapples with the possibility, and ultimately, the reality of a black president who literally changed the national conversation on race. Coupled with the OkCupid numbers, this data reveals what people think and feel when they are least guarded — sitting alone at their computers.

And sure, Big Data can tell us about the searcher’s location and other census-level demographics, but what exactly is driving these individuals to wake up on Tuesday morning, fire up their laptops, and search for the most inflammatory racial epithet in America? What are these individuals hoping to find? And to what end? Yet again, Big Data gives us big picture insights, but leaves us with more questions about the drivers at work.

Barack Obama debating during his 2008 campaign.

Perhaps the best examples of when Big Data can serve as a roadmap for further research come from Rudder’s discussion of communities on Reddit and Twitter. When we think of groups of people, we traditionally imagine physical communities in a given place, but increasingly, community is found online. Rudder gets a gold star for referencing Benedict Anderson’s Imagined Communities to explain how like a nation, these online groups have a sense of unity and a common identity despite never meeting most of its members. Rudder used an algorithm to map out the 200 most popular topics or “subreddits” on the popular website Reddit. In the figure below, he plots a “geography of interests,” showing community sizes and their degree of integration with other groups — the darker the red, the more isolated and vice versa. Not only do we know what communities exist online, but we also understand with whom they communicate. Rudder similarly maps out distinct communities on Twitter using linguistic trends and hashtags. While anthropologists and other social scientists are already online, having maps like these can help get a better sense of the “terrain,” pre-sorting by common interests, views, influences, and even linguistic nuances.

Courtesy of Dataclysm: Who We Are When We Think No One’s Looking

While there are many great applications of Big Data that can increase our understanding of people, the practice as a whole is still reductive in nature. Numbers suggest trends about large groups of people, but they do not speak for the individual as a whole. Even when our personal data is part of the larger analysis, it’s only a part of who we are and it’s reduced to little dots that together form a whole of “us.” When computers try to “make digital sense of an analog world” (217) what you get is pointillism — snapshots of communities that help orient, but ultimately need to be filled in with more color and texture from other forms of research and analysis.

Additionally, while Big Data is numbers-based, that does not mean it lacks a subjective agent that can skew the data. When comparing qualitative and quantitative research, there is often the gross misconception that the latter is more accurate because it does not have a human processing and analyzing the information — the calculator, computer or sophisticated algorithm carries out the analysis. But the human hand is involved in any kind of research and subjectivity comes in when selecting what methods to use, which data sets to analyze, and how to communicate that information. Big Data is no exception. In fact, in all the examples I related above, there could’ve been a myriad of different interpretations of the causes for the phenomena identified by Big Data.

Of course, Big Data is still very young and we will get better at dealing with and interpreting large amounts of data. It will also become a more powerful tool once longitudinal data becomes more available.

As such, the limitations I’ve described do not discredit the use of Big Data, but a general awareness of them will lead to better, more robust research, particularly if coupled with qualitative research methods that zoom in as Big Data zooms out.

Show your support

Clapping shows how much you appreciated Miranda M. Garcia’s story.