3 Smart Data Journalism Techniques that can help you find stories faster

16 min readJan 26, 2018

Note: All views expressed in this post are within my capacity as a student at Columbia Journalism School, not as a data scientist on the business side of the New York Times.

Text processing has never been easier or more powerful. Across industries, analysts increasingly complement close reading with computational approaches to gain insight from large volumes of text. Companies, for instance, assess customer sentiment from millions of reviews or follow topics discussed on social media in real-time.

Meanwhile, the volume of documents available for journalistic inquiry has exploded: reams of information on government operations (Wikileaks Cablegate: 200,000 pages,) private wealth shelters (Paradise Papers: 13.4 million pages,) and public figures’ communication (Sarah Palin’s emails: 24,000 pages) leak, it seems, almost monthly.

The public benefits when journalists quickly extract newsworthy information from these documents. Yet, the sheer size of many dumps means that newsrooms often expend tremendous resources, cannot thoroughly analyze them by themselves, or, worse-case, miss information.

In this post, I’ll describe computational techniques that I think can allow journalists to quickly gain insight into large sets of documents and select specific documents to read. The body of this post is broken into 3 sections:

Natural Language Processing techniques. (Approach #1: Look at the words being used.)
Topic-modeling techniques. (Approach #2: Look at the topics.)
Classification techniques. (Approach #3: I’ll know it when I see it.)

I’ll demo each approach on a publicly available Wikileaks corpus. In future posts, I hope to give a more technical description, code and infrastructure that I think would be helpful to maintain to facilitate such explorations.

Before we start.

First, a bit about me: I’m a data scientist on the business side of the New York Times, where I have spent years implementing machine learning research for business-side and newsroom stakeholders. I am also a student at Columbia University’s School of Journalism.

At Columbia this fall, a professor asked me to speak to a class about how machine learning techniques can be used by investigative reporters. I found few existing explanatory resources for data journalists, and scant evidence of techniques (which are now standard in other industries) used in practice.

Hence, the idea for this post: to demonstrate how machine learning techniques might be used to explore large document dumps and find leads.

This post does not require coding knowledge. However, code for the demo can be found here: https://github.com/alex2awesome/investigative-reporting-lecture.

Wikileaks Documents

The example I will use to guide this discussion is from the Wikileaks Cablegate. The dump is massive, and I unarchived about half of it for this demo. This included over 300,000 files, although some were duplicated. Because of computation limits (i.e. my computer is slow,) this analysis is performed over a sample.

The median length of these documents is about 2,000 words, but some are as long as 10,000.

Here is an example of what one of the cables looks like, chosen at random:

Department of state 1964-66 Central Foreign Policy File File: POL 33-4 ARG --------------------------------------------- ----  E.O. 12958: DECL: DECLASSIFIED BY NARA 09/02/2009 TAGS: EFIS PBTS ARSUBJECT:  EXTENDED NATIONAL JURISDICTIONS OVER HIGH SEAS  REF: STATE 106206 CIRCULAR; STATE CA-3400 NOV 2, 1966   1.  PRESS REPORTS AND VARIETY EMBASSY SOURCES CONFIRM NEW ARGENTINE LEGISLATION UNILATERALLY CHANGING SEAS JURIS- DICTION NOW UNDER ADVANCED REVIEW.  REPORTEDLY LAW WOULD ESTABLISH SIX MILE TERRITORIAL SEA, PLUS ANOTHER SIX MILES OF EXCLUSIVE FISHING JURISDICTION, PLUS ANOTHER EXTENDED ZONE OF "PREFERENTIAL JURISDICTION" FOR FISHING PURPOSES.  DRAFT- LAW UNDER CONSIDERATION IN ARGENTINE SENATE BEFORE JUNE 28 COUP WOULD HAVE DEFINED ZONE OF PREFERENTIAL JURISDICTION AS "EPICONTINENTAL SEA OUT TO 200 METER ISOBAR".  IN SOUTHERN ARGENTINA THIS ZONE SEVERAL HUNDRED MILES WIDE AND BLANKETS FALKLAND ISLANDS.  2.  NAVATT STATES ARGENTINE NAVY THINKING OF PREFERENTIAL JURISDICTION OUT TO 200 MILES (AS IN PERU, ECUADOR, CHILE) RATHER THAN EPICONTINENTAL SEA.  200 MILE LIMIT DOES NOT RPT NOT REACH FALKLANDS.  ARGENTINE NAVY OFF TOLD NAVATT "200 MILE LIMIT SOON WILL BE STANDARD THROUGH HEMISPHERE".  3.  FONOFF OFFICIALS REFERRING TO RECENT BRAZILIAN AND US LEGISLATION HAVE INFORMALLY INDICATED DECISION ALREADY FINAL RE SIX MILE TERRITORIAL SEA PLUS SIX MILE EXCLUSIVE FISHING JURISDICTION, BUT THAT "PREFERENTIAL JURISDICTION" STILL UNDER STUDY.  TWO FONOFF MEN VOLUNTARILY AND INFORMALLY SOUGHT EMBASSY REACTION TO POSSIBLE EXTENDED PREFERENTIAL JURISDICTION BY SUGGESTING THAT US IN FACT HAS ACCEPTED UNILATERALLY CREATED ECUADORIAN, PERUVIAN AND CHILEAN 200 MILE LIMITS.

These documents have already been used in reporting.

Approach #1: Look at the words being used.

This approach is the quickest and easiest: we’ll simply look at the most frequent words. Below, we count every word in the corpus: “health-care,” “Iraq,” “war,” for instance. (To avoid many of the most frequent, boring words like “the,” “and,” or “or,” we’ve cut out words that appear in every document.)

Overall count of words that appear in 80% or less of the documents.

We can see some contours emerging. We know that “government” is discussed. The word “security” is used a lot too, as well as “president.” But let’s delve deeper.

Part-of-Speech Tagging

We want to gain insight into principal actors and actions in this corpus. To do this, we need to determine whether a word refers to an actor, or an action. A first step in this direction involves analysis of Part-of-Speech (POS.)

A visualization of a single sentence with part-of-speech tags. (Image courtesy of: *http://nlpforhackers.io/training-pos-tagger/*)

Identifying a word’s part-of-speech tells us whether it is noun (NN), a determiner (DT), or a verb (VBZ), for instance. Then we can count, say, only the verbs. (On methods to perform this tagging, stay tuned for a later post.)

Here are the counts of verbs in the corpus:

The most common Verbs being used in the corpus.

This gives a vague sense of what the actors are doing. “Say” is used often, as is “note,” and “tell”: all variations of communicating. I’m interested in “provide,” or “give”. Who is giving to whom? What is going on here?

Maybe we can gain some further insight by looking at proper nouns:

The most common Proper Nouns being used in the corpus.

“Iran” and “Washington” emerge as common proper nouns. But proper nouns include countries, people, and titles. We need a more sophisticated tool if we want to determine who are the people most frequently discussed.

Named Entity Recognition

Enter named-entity tagging. This tool distinguishes between proper nouns: “geo-political entity” (GPE), “organization,” “person,” for instance.

An example sentence tagged with named-entities. “GPE” refers to geo-political entity.

Now we can look, for example, at the most frequently discussed people:

The most common people discussed in the corpus.

This yields some fascinating insight. Besides “al” and “de,” common name-parts, the most discussed person in this corpus is “Qaddafi.” “Obama” is also used very frequently.

As a final step in this exploration, let’s look at the most commonly discussed countries:

The most common geo-political entities discussed in the corpus.

“Iran” is discussed most, followed by “Brazil.” Already we have a more detailed view on the foci of this corpus. We know the primary actors, we know the main countries, and we have a sense of common actions.

Filtering

So, we can now intelligently target our reading. Compelling stories are about people, so let’s explore interactions between people. Since we now know some contours of this corpus, let’s filter to the documents with the following words:

Person 1: “Obama”
Verb: “provide”
Person 2: “Karzai”

This combination of search terms helps us filter our corpus down to 14 articles, out of our starting sample of 4000. This is a 99% reduction.

Delving into these 14 cables gives us some interesting reads. In this cable, a U.S. ambassador records a conversation between himself and Afghanistan President Hamid Karzai: he mentions that Barack Obama is providing Karzai tools for “security,” “accountable government,” and “a working economy.” Another cable states that Karzai may provide the U.S. with an “open door” to engage with Iran. These both might be worth exploring more.

We soon stumble into a short-coming of this technique: a cable in our filtered set does indeed mention the words “Obama,” “Karzai,” and “provide,” but it is about a conversation between then Secretary of State Hillary Clinton and a French official. Karzai is mentioned, but in passing. This shows that we missed the context these words were used in. How do we address that?

Approach #2: Look at the Topics.

In section 1, we looked at word frequencies to gain insight into the contours of the corpus. With this insight, we narrowed our search and focused on potentially interesting documents.

However, this approach has shortfalls, as we saw. We might know which set of documents contains “Karzai” and “provide,” but we won’t know how these words are used without delving into the documents. Word frequency alone doesn’t tell us much about the contexts words are used in.

One approach might be to count words that appear only in this small set of documents, and compare to overall frequencies. But this can become time consuming and messy. There’s a better way: enter… topics.

Topic Modeling

Analyzing topics can give us tremendous insight into a corpus, and narrow our document-search. But first: what do I mean by “topic”?

A topic, on a high level, is a certain pattern among the words used in documents. Within each topic, some words in the vocabulary are more important, others less. And each document is made up of a mixture of topics.

We can model this. Below is an example. Listed are the top words in example topics extracted from our corpus:

topic example # 1
china   chinese   cable   said   africa   beijing   reference   new   ...

topic example # 2
said   cable   israel   iran   would   turkey  israeli   president   reference ...topic example # 3
russia   russian   cable   moscow   putin   french   political   would   us   france ...

topic example # 4
north   korea   would   missile   said   dprk   kosovo   defense   cable   states ...

topic example # 5
women   ipr   mexico   violence   gender   mexican   organizations   victims   protection   justice ...

topic example # 6
gas   oil   bp   company   energy   production   gazprom   said   would   billion ...

These 6 examples were extracted simply by passing the corpus into the algorithm — little information was specified beforehand. You can see how each set of words captures a pattern we’d recognize in a real-world topic.

For instance, we see the words “Israel,” “Iran” and “Turkey” weighted highly in example #2: we might think of this as a Middle-East topic. This tells us that Iran is most likely to be discussed along with Turkey and Israel.

Two important insights a topic-model gives us are:

What topics are contained in a corpus.
How each document expresses those topics.

A bit more on the second point: each document is a mixture of the topics. For example, a document might be 100% of one topic, 100% of another, 50%/50% of two, etc. Turns out, we can use a document’s topic breakdown to get better understanding of that document before reading it.

(The topic model I am using here is Latent Dirichlet Allocation (LDA). For more information, see this post, or stay tuned for subsequent posts in this series. When one passes documents into a topic-modeling algorithm, the only piece of information one specifies is the number of topics. There are models where this does not need to be specified, but LDA is widely implemented and easy to use. Typically for corpus exploration — a journalists’ main use-case — one might increase the topic number until the topics stop making sense.)

Let’s use topics to find interesting documents. For instance, let’s look at example #6, the “energy” topic:

topic example # 6
gas   oil   bp   company   energy   production   gazprom   said   would   billion ...

We can now look at documents that contain this topic. Most of them, as we can see, have nothing to do with energy:

The count of documents throughout the entire corpus based on how much they discuss energy.

There is a large number of documents that are zero percent about energy — most documents do not discuss energy at all. How many, though, do?

The same histogram, except zoomed in to only include documents that are more than 0% about energy. Count of documents that are more than 40% about energy are shown.

We can see that the total amount of documents that have something to do with this topic is numbered in the tens, and, furthermore, there are just 26 documents that are more than forty percent about the energy topic. So, out of a corpus of thousands, we can quickly narrow down to these 26 documents.

Because of the richness of topics, we already can have some expectation of what these documents will contain: discussions on energy, also, likely, money (the word “billions” is showing up as important in this topic).

One cable, for instance, details the billions of dollars in subsidies that Gazprom and other Russian oil companies provide to other countries through discounted resources. Another cable discusses the likelihood that Gazprom will buy an oil refinery in the Baltic region, and the economic impact of this move.

A third, slightly longer cable, discusses a conversation between German and Russian leaders, mentioning Gazprom but not focusing on it. This cable might be off-topic for us (no pun intended.) Is there a way to avoid this?

Examining the this cable’s topic mixture before reading might have helped us identify that it was mainly about something else. But no matter how hard we try, we will read off-topic pieces. Is there a way to capture what we learn after reading, and use this information to refine our search?

Approach #3: I’ll know it when I see it.

In the first section, we examined word counts, and then focused on specific types of words. In the second section, we introduced the concept of a topic-model, and used it to explore and filter documents by topic.

Both of these approaches helped us find interesting documents by giving us insights into the corpus, and letting us narrow our search from there. In the third section, we’ll take the opposite approach.

First, we’ll find interesting documents, then we’ll use these to identify interesting insights, and use these to organize the rest of our corpus.

Classification

As the graphic below describes, classification is an approach whereby the journalist “seeds” a model with examples, letting the model learn insights from these examples. Then the model uses these insights to categorize the rest of the corpus. In this process, the journalist uses the model to replicate their decision-making and cover more ground.

As an example, I went through this process for our corpus.

Training the classifier

I started reading cables and, as I read, I labeled them as “Interesting” or “Not Interesting.”

One of the first cables I came across was a cable written in 2004 detailing European Union (EU) concern’s with Turkey’s application for membership. I decided to find out more about this, so I labeled it “Interesting” and kept an eye out for similar stories. The next several I read had nothing to do with this, so I labeled them “Uninteresting.” I soon came across another cable, written in 2007, about Iranian purchase of German computers through Turkey, and labeled this as interesting too.

I labeled some more documents and, having compiled a small set of examples, I trained a classifier. (For features, I used word-counts for each word. For a classifier, I used logistic regression. See subsequent posts for more information.)

A classifier is, in broad terms, a model that takes as input a datapoint (in our case, a document) and tries to label it (in our case, as “Interesting”, or “Not Interesting”). We train a classifier to label by giving it examples.

Using the classifier

After training my classifier, I asked it which documents in the rest of the corpus were likely to be interesting.

A cable about Turkish frustration towards EU rejection popped out, as did a cable about Cyprus blocking Turkey’s NSG Chairmanship, and another about a pro-Turkey EU Ambassador. Many uninteresting cables also came up.

My classifier is not very good, yet. It is trained on too few examples and is not a sophisticated model. But some of the documents it surfaced were immediately useful, and it already helped narrow my search path.

Next Steps

To improve our classifiers, we might label more examples. Or, we could look at features the classifiers learned are important. This might suggest ways to tweak the features, or ways to manually use the features, like in approaches #1 and #2. Lastly, we could chose different kinds of classifiers.

There are many ways to design classification tasks. However, all follow the approach we took: designing labels, choosing input features, and selecting a model. An input feature can be any aspect of the document: word frequencies, or topics, for example. A label can be binary (“Interesting” vs. “Not Interesting”) or multi-class (let’s say we wanted to classify lawsuits — we might design labels like: “Malpractice”, “Homicide”, etc.) As you can see, the flexibility in choosing labels and features make this tool incredibly flexible and broadly applicable for many different tasks in data journalism.

One variation on the approach outlined is: instead of labeling documents at random (in successive rounds of labeling), be strategic. Choose documents that confuse the model the most. This is known as active learning.

Conclusion

We’ve explored a set of approaches to gain insight into large document sets and to surface interesting documents. We’ve described each approach, and shown how using the approach on a demo document set yielded interesting results.

The first approach involved two techniques called Part-of-Speech tagging and Named-Entity Recognition. Using these two techniques, we looked at the most frequent words among various word-types, and answered questions about who the principal actors in a corpus were, what actions were taken, and what settings were discussed.

The second approach involved a technique called Topic-Modeling. As described, a topic-model identifies “topics,” or patterns among words that roughly correspond to real-world topics. We started by looking at these topics to gain insights into the corpus, and then used them to narrow down the set of documents we read.

The third approach involved a technique called Classification. Classification involves choosing labels we’re interested in, and labeling documents. For this approach, we went through and labeled a small set of documents as “Interesting” or “Not Interesting,” then we trained a classifier to predict labels on the rest of the corpus. Using these predictions, we retrieved documents in the corpus most likely to be interesting.

Data scientists will often use these techniques together, along with other techniques as part of an analysis. While a good review always involves much manual engagement with the data, these computation approaches allow us to gain more contextual insight than if we had simply read each document individually. By iteratively switching between statistical analysis and pinpointed reading, an effective data journalist can refine their analysis and cover more ground quickly.

In my own story-telling experience, named-entity recognition has been a powerful tool. For this Times article, 19 Countries, 43 States, 327 Cities: Mapping The Times’s Election Coverage, I wanted to get a sense of how wide our 2016 election reporting was, by location. This required analyzing heaps of unstructured text: datelines are structured, contributor lines are not. Rather than manually analyzing contributor lines in every election article published, I extracted names and locations automatically, like in the demo, and counted.

Existing Software

There are open-source applications — i.e. non-coding software — that can do parts of what was discussed in the post. Here are a couple:

Open Calais, at Thompson Reuters, aims to perform many different kinds of word-tagging, like part-of-speech tagging and named-entity recognition.

Jonathan Stray, mentioned above, has developed an open-source project for parts of document analysis, called Overview. Overview has robust search and visualization features and topic-model-style clustering.

Many more can clean data well, or visualize, or manage documents.

However, part of exploratory analysis, for me, always involves a kitchen-sink approach. And that fundamentally can’t happen in a constrained application environment. Mike Bostock says it best: “Code is often the best tool we have because it is the most general tool we have; code has almost unlimited expressiveness.” So while these tools might be a great way to familiarize oneself with these techniques, I believe it’s optimal to move the analysis into Python, or R, eventually. Stay tuned for a later post on implementation.

Other ideas about machine learning in newsrooms:

These are some writers talking about other ways machine learning can be useful in newsrooms:

This talk focuses on how auto-summarization can summarize corpuses. (I have not used summarization personally for my work, so can’t comment on the utility.) This is a paper on slug generation, auto-summarizing, data gathering and trend-detection. This is slightly outside of the data-analysis scope of this piece, and I have not used their techniques, so I cannot comment. This piece talks about machine learning for fake news detection, which is also outside of the scope of this piece.

Theoretical Background:

For background on the theory behind Part-Of-Speech tagging and Named Entity Recognition, here is a link to Dr. Michael Collin’s Natural Language Processing course at Columbia University. This is introductory course I took while an undergraduate, and I found it very informative. Here are other courses openly available from Stanford and Massachusetts Institute of Technology. My favorite library to perform these tasks is the SpaCy package, but stay tuned for a future article on implementation details.

For background on the theory behind topic-modeling, here is a talk by Dr. Blei. Dr. Blei is a leader in the field, and his homepage contains many usable code packages. Here is an article describing ways the New York Times Data Science team uses topic modeling. My favorite library to perform these tasks is the Gensim library. Again, stay tuned for implementation details.

Classification is an enormous field, and for background, it’s best to start with an introductory machine learning class. Andrew Mueller at Columbia University has a good applied course with materials available. Massachusetts Institute of Technology offers a course online. Here is a broad technical overview of text classifiers, and here is a more implementation-focused paper. I would not recommend neural network-based approaches for this task, since they require a fair amount of tuning and will be more time consuming than is worth for an exploratory task like this. My favorite library to perform classification tasks is scikit-Learn. Specifically, Pipelines. But more later.

Here is an overview of active learning from Dr. Burr Settles. To hammer in how awesome active learning is, here’s a comparison of data selected for labelling using active learning vs. random selection (courtesy of Dr. Settles paper):

Accuracy of binary label “baseball” vs. “hockey” text classification task using Active Learning approach (via uncertainty sampling) versus a random selection of documents.

As you can see, in the range of 10–20 data points (a good starting place for analysis of document dumps), there is a 10-point accuracy increase when documents are selected for journalist-labeling using active learning vs. random-choice.

A final note: I code everything in Python. Python is intuitive and powerful, and many of the best new implementations of the tools discussed are in Python.