How the Wall Street Journal is using deep learning to inform content strategy

Data scientists are working alongside journalists to explore how well-established machine learning methods can help to easily find gaps in editorial coverage.

Francesco Marconi
WSJ Digital Experience & Strategy
4 min readOct 11, 2019

--

Coverage Map is a tool that can find opportunities in WSJ’s editorial coverage and audience impact by using deep learning.

Only a few years ago, using artificial intelligence in journalism was cutting edge, but nowadays it is quickly entering the workflow of a growing number of news organizations. The knowledge around machine learning methods is becoming more accessible as the cost of AI projects has gone down since many models are readily available online and can be implemented — even by small newsrooms.

Using these models, our team of data scientists took on, at first sight, an enormous challenge: analyzing and deriving insights from a decade’s worth of WSJ articles. By using well-established language analysis methods like Doc2Vec, we were able to quickly reveal what the Journal has been reporting on, pinpoint gaps in coverage, and identify topics that were of particular value to our readers.

Doc2Vec translates text documents such as news articles into numerical representations.

This approach enabled us to translate a large set of documents (WSJ news articles) into a list of representative numbers, known as “vectors.” Think of it as a map on which articles are placed according to their numerical value. The more similar two articles are, the closer they are located to each other on the map.

This XY coordinate space represents the distribution of topics covered by the Wall Street Journal. By plotting individual news articles on this space we can create a sort of visual signature for the Journal.

Leveraging Deep Learning to identify what information is relevant to our readers.

The results of our cluster analysis showcase the benefits of using machine learning as an input in our strategy:

  1. The coverage map was able to cluster over 20,000 Journal articles into 361 hyper-granular clusters with more nuance than traditional metadata allows. The articles contained in these clusters are all semantically similar — that is, they use similar words in similar ways. Looking through the “heat map”, journalists can see very specific clusters about, for example, US monetary policy, the US-China tech war, immigration in Europe, personal finance related to healthcare or green energy markets.
  2. Looking at clusters with high conversion rates, meaning topics that prompted a lot of readers to subscribe to the Journal, help us identify coverage opportunities that can impact new readership and drive engagement.
  3. By pairing the resulting granular topic information with additional metrics like story length and scroll depth, we can discover which areas of coverage our readers are responding to best. This, in turn, allows us to develop content strategy briefs for different sections of the Journal that align with the information needs of people who read our publication.

Why is this important for editors?

The Journal publishes hundreds of articles a day and it's impossible for editors to retain all the knowledge of all stories produced over the past weeks, months or years. And while the human understanding about the nature of articles is very sophisticated, it does not scale when we are talking about thousands of articles. These clusters then, allow editors to augment their editorial intuition.

WSJ’s lead technologist John West along with data scientist Mark Secada and ML scientist Eric Bolton built an explorer tool, a navigable digital platform that allowed editors and audience analysts to examine unique subsets of data. While traditional analytics dashboards give newsrooms insights about recent performance, this project strengthens the “long-term memory of the newsroom”.

Articles clustered through deep learning allow editors to scale their editorial intuition. And the explorer tool allows them to dig deeper. They can also explore average word counts of a cluster, specific article headlines, the context in which it was written or even look at collections of articles with similar themes. Editors can see how we’ve covered a story — looking at the type of story, length, topic and more — over time.

“Pair these insights with audience data and editors can get a pretty good sense of how to promote a story, or understand what packaging they should give to a story when they post it on social media,” says John West, Lead Technologist at WSJ R&D

The insights that the editors surface from this tool are regarded as opportunities for potential experimentation. Data science is being used to inform the journalistic process, not automate it.

“We saw where the Journal’s coverage was, and where it could go after combining techniques in deep learning and statistics,” explains Mark Secada, data scientist at the Wall Street Journal’s R&D team.

Machine informed. Human led.

Through this approach, we are surfacing a number of hidden insights that are shared with editors so they can take the data and decide how to use it.

This navigable digital platform allows editors and audience analysts to examine unique subsets of data and explore the story map themselves to identify clusters most relevant to them.

  1. Artificial intelligence and machine learning allow journalists to analyze data, identify patterns and trends from multiple sources and uncover hidden insights — this project being one example.
  2. The knowledge around AI has been democratized since a lot of machine learning models are available online and can be easily be implemented by any newsroom.
  3. While these tools help augment journalism, they will never replace it. AI might aid in the reporting process, but journalists will always need to apply their editorial judgment.

Interested in journalism and AI? WSJ is hiring a Machine Learning Engineer to join the newsroom R&D team.

--

--