Of scripts, scraping and quizzes: how data journalism creates scoops and audiences

Tow Center
Tow Center
Published in
7 min readJan 28, 2014

As last year drew to a close, Scott Klein, a senior editor of news applications at ProPublica, made a slam-dunk prediction: “in 2014, you will be scooped by a reporter who knows how to program.”

While the veracity of his statement had already been shown in numerous examples, including the ones linked in his post, two fascinating stories published in the month since his post demonstrate just how creatively a good idea and a clever script can be applied — and a third highlights why the New York Times is investing in data-driven journalism and journalists in the year ahead.

Tweaking Twitter data

One of those stories went online just two days after Klein’s post was published, weeks before the new year began. Jon Bruner, a former colleague and data journalist turned conference co-chair at O’Reilly Media, decided to apply his programming skills to Twitter, randomly sampling about 400,000 accounts over time. The evidence he gathered showed that amongst the active Twitter accounts he measured, the median account has 61 followers and follows 177 users.

number_of_followers_histogram-620x467

“If you’ve got a thousand followers, you’re at the 96th percentile of active Twitter users,” he noted at Radar. This data also enabled Bruner to make a widely-cited (and tweeted!) conclusion: Twitter is “more a consumption medium than a conversational one–an only-somewhat-democratized successor to broadcast television, in which a handful of people wield enormous influence and everyone else chatters with a few friends on living-room couches.”

How did he do it? Python, R and MySQL.

“Every few minutes, a Python script that I wrote generated a fresh list of 300 random numbers between zero and 1.9 billion and asked Twitter’s API to return basic information for the corresponding accounts,” wrote Bruner. “I logged the results–including empty results when an ID number didn’t correspond to any account–in a MySQL table and let the script run on a cronjob for 32 days. I’ve only included accounts created before September 2013 in my analysis in order to avoid under-sampling accounts that were created during the period of data collection.”

A reporter that didn’t approach researching the dynamics of Twitter this way, by contrast, would be left to try the Herculean task of clicking through and logging attributes for 400,000 accounts.

That’s a heavy lift that would strain the capacity of the most well-staffed media intern departments on the planet to deliver upon in a summer. Bruner, by contrast, told us something we didn’t know and backed it up with evidence he gathered. If you contrast his approach to commentators who make observations about Twitter without data or much experience, it’s easy to score one for the data journalist.

Reverse engineering how Netflix reverse engineered Hollywood

netflix-genre-generator

Alexis Madrigal showed the accuracy of Klein’s prediction right out of the gate when he published a fascinating story on how Netflix reverse engineered Hollywood on January 2.

If you’ve ever browsed through Netflix’s immense catalog, you probably have noticed the remarkable number of personalized genres exist there. Curious sorts might wonder how many genres there are, how Netflix classifies them and how those recommendations that come sliding in are computed.

One approach to that would be to watch a lot of movies and television shows and track how the experience changes, a narrative style familiar to many newspaper column readers. Another would be for a reporter to ask Netflix for an interview about these genres and consult industry experts on “big data.” Whatever choice the journalist made, it would need to advance the story.

As Madrigal observed in his post, assembling a comprehensive list of Netflix microgenres “seemed like a fun story, though one that would require some fresh thinking, as many other people had done versions of it.”

Madrigal’s initial exploration of Netflix’s database of genres, as evidenced by sequential numbering in the uniform resource locator (URLs) in his Web browser, taught him three things: there were a LOT of them, organized in a way he didn’t understand, and manually exploring them wasn’t going to work.

You can probably guess what came next: Madrigal figured out a way to scrape the data he needed.

“I’d been playing with an expensive piece of software called UBot Studio that lets you easily write scripts for automating things on the web,” he wrote. “Mostly, it seems to be deployed by low-level spammers and scammers, but I decided to use it to incrementally go through each of the Netflix genres and copy them to a file. After some troubleshooting and help from [Georgia Tech Professor Ian] Bogost, the bot got up and running and simply copied and pasted from URL after URL, essentially replicating a human doing the work. It took nearly a day of constantly running a little Asus laptop in the corner of our kitchen to grab it all.”

What he found was staggering: 76,897 genres. Then, Madrigal did two other things that were really interesting.

First, he and Bogost built the automatic genre generator that now sits atop his article in The Atlantic, giving users something to play with when they visited. That sort of interactive would not be possible in print nor without collecting and organizing all of that data.

Second, he contacted Netflix public relations about what they had found, who then offered him an interview with Todd Yellin, the vice president of product at Netflix that had created Netflix’s system. The subsequent interview Madrigal scored and conducted provided him and us, his dear readers, with much more insight into what’s going on behind the scenes. For instance, Yellen explained to him that “the underlying tagging data isn’t just used to create genres, but also to increase the level of personalization in all the movies a user is shown. So, if Netflix knows you love Action Adventure movies with high romantic ratings (on their 1–5 scale), it might show you that kind of movie, without ever saying, ‘Romantic Action Adventure Movies.’”

The interview also enabled Madrigal to make a more existential observation: “The vexing, remarkable conclusion is that when companies combine human intelligence and machine intelligence, some things happen that we cannot understand.”

Without the data he collected and created, it’s hard to see how Madrigal or how anyone else would have been able to publish this feature.

That is, of course, exactly what Scott Klein highlighted in his prediction: “Scraping websites, cleaning data, and querying Excel-breaking data sets are enormously useful ways to get great stories,” he wrote. “If you don’t know how to write software to help you acquire and analyze data, there will always be a limit to the size of stories you can get by yourself.”

Digging into dialect

The most visited New York Times story of 2013 was not an article: it was a news application. Specifically, it was an interactive feature called “How Y’all, Youse, and You Guys Talk,” by Josh Katz and Wilson Andrews.

dialect_question_nyt

While it wasn’t a scoop, it does suggest us something important about how media organizations can use the Web to go beyond print. As Robinson Meyer pointed out at The Atlantic, the application didn’t go live until December 21, which means it generated all of those clicks (25 per user) in just eleven days.

The popularity of the news app becomes even more interesting if you consider that it was created by an intern: Katz hadn’t joined the New York Times full-time when he worked on it. As Ryan Graff reported for the Knight Lab, in March 2010 Katz was a graduate student in the Department of Statistics at North Carolina State University. (He’s since signed on to work on the forthcoming data-driven journalism venture.)

Katz made several heat maps using data from the Harvard Dialect Survey and posted them online. That attracted the attention of the Times and led him to an internship. Once ensconced at the Old Grey Lady, he created a quiz to verify the data and update it, using a dialect quiz. He then tested 140 questions on some 350,000 people to determine the most-telling questions. With that data in hand, Katz worked with graphics editor Wilson Andrews to create the quiz that’s still online today.

What this tells us about data-driven journalism is not just that there is a demand for skills in D3, R and statistics newsrooms: it’s that there’s a huge demand for the news applications those that possess them can create. Such news apps can find or even create massive audiences online, an outcome that should be of considerable interest to the publishers that run the media companies that deploy them.

On to the year ahead

All of these stories should cast doubt about the contention that data-driven journalism is a “bogus meme,” fated to sit beside “hyperlocal” or blogging as saviors of journalism. There are several reasons not to fall into this way of thinking.

First, journalism will survive the death or diminishment of its institutions, regardless of the flavor of the moment. (This subject has been studied and analyzed at great depth in the Tow Center’s report on post-industrial journalism.)

Second, data-driven journalism may be a relatively new term but it’s the evolutionary descendent of a much older practice: computer-assisted reporting. Moreover, journalists have been using statistics and tables for many decades. While interactive features, news applications, collaborative coding on open source platforms, open data and data analyses that leverages machine-learning and cloud computing are all new additions to this landscape, many practitioners of computer-assisted reporting have been doing it for decades. Expect new ones from Nate Silver’s rebooted FiveThirtyEight project in the months ahead.

Finally, the examples I’ve given show how compelling, fascinating stories can be created by one or two journalists coding scripts and building databases, including automating the work of data collection or cleaning.

That last point is crucial: by automating tasks, one data journalist can increase the capacity of those they work with in a newsroom and create databases that may be used for future reporting. There’s one reason (among many) that ProPublica can win Pulitzers prizes without employing hundreds of staff.

kvartiraarenda.com

ontex.com.ua

online shop

--

--

Tow Center
Tow Center

Center for Digital Journalism at Columbia Graduate School of Journalism