Staying on Topic — Building an Automated Topic Model of WSJ News Coverage

Tess Jeffers
WSJ Digital Experience & Strategy
6 min readMar 26, 2021

When some people pick up a copy of The Wall Street Journal, they expect to read about one topic: business. But the Journal covers a lot more than business — including major world events, the latest from Capitol Hill, headphones for your work-from-home life, or what to make for dinner.

As the Journal continues its digital transformation with increased focus on our MACU (members, audience, customers, users), our Digital Experience and Strategy group is digging in to further understand our audience and deliver more personalized and curated news experiences. As part of this effort, the News Insights team within DXS has been building a new automated topic model to help annotate and organize our content.

Why does the Journal need an automated topic model?

Our team is often tasked to help understand what types of content resonate with our audiences. To do so, we rely on metadata attributes or “tags” that have been added to each article as it is published. The majority of article metadata is driven by traditional divisions in the newspaper, like an article’s section of the paper (e.g. “Business”), or relevant keywords mentioned within the story (e.g. “stone mining, motor vehicles, alternative fuel vehicles”). These tags, while extremely powerful, may not always tell us all we need to know — and indeed, teams at WSJ have spent quite a bit of time manually adding more.

By contrast, a topic model has the ability to automatically learn topics we write about that no one ever thought to label — and can be helpful to help us summarize each article into one label that describes “what’s it all about.” We use these topic labels as an analytics tool to group content together and answer content performance questions (e.g. Which topics are most engaging on the WSJ app on weekdays vs. weekends?). Grouping articles by machine-generated topics instead of hand-labeled ones might reveal insights we wouldn’t have otherwise found.

In addition to our data science use cases, we’re also thinking about how topics are relevant for our MACU and their experiences with the Journal. First, automated topic detection and labeling could help the Journal create dynamic topic landing pages, to help the reader navigate on-site or “follow” relevant topics of interest. Second, topic labels show potential to be useful as a feature for content recommendations, as we personalize what Journal audiences can read next.

Features of a good topic model for WSJ

The ideal topic model for WSJ should have a few key features. It should be dynamic, meaning it should be able to gracefully accept new articles each day as new articles are published. It should be hierarchical, implying that for a specific piece of content, we could assign multiple topics at multiple levels of granularity, like “Politics” (coarse-grained) or “Trump-China Tax Policy” (fine-grained). An ideal model would also be able to generate human-interpretable labels, and of course it must be precise and useful in the product use-cases outlined above.

These technological requirements posed a big challenge. From our experience, we’ve found that good topic models are relatively easy to create “in batch” and very hard to deploy in production. This requirement was the most challenging to meet, as most popular approaches in the literature, like Latent Dirichlet Allocation (LDA), are not dynamic, and cannot easily accept new vocabulary over time. In fact, we’d argue that LDA meets zero of the three requirements above.

How our model works

As the director of data science here, I took on this project in earnest in August 2020. First, I collected over two years of Journal articles from the WSJ content management system. In total, my model is trained on over 67,000 articles across all sections of the Journal, like Markets, Business, World, U.S., Politics, Life, and Arts. I trained a custom Doc2Vec neural network on these articles to return the joint word and document embeddings. These document embeddings convert the article text into a vector of numbers representing an article’s relative location in semantic space.

Next, I projected this high-dimensional model into a lower dimensional space with UMAP, and applied clustering with HDBSCAN on the lower dimensional projection to identify areas of high density (Figure 1). Each high-density area represents a topic cluster, where many articles were written using words with similar semantic meaning. This feature allows the topic model to differentiate and make sense of the same word used in different contexts. For example, although both articles discuss taxes, an article about how to save on your 2020 taxes is correctly pushed into a topic cluster separate from an article about Maryland’s new digital advertising tax.

Data visualisation of WSJ articles from 2019 & 2020, clustered by density and colored by affiliation
Figure 1. 2-Dimensional representation of Journal articles from 2019 & 2020, clustered by density and colored by affiliation to one of 15 topics.

At the end of this process, the model generates a series of topics, as described by a list of topic keywords that try to summarize what each topic is about. Every article receives a probability score describing how “close” each article is to belonging in one of the N topics. For now, I assign each article only to its topic of “best fit.” The model is deployed in the Cloud via EC2 on AWS, and ingests new article text from our CMS API as they are published. The new topic labels are appended to the article data, and then written to a lookup-table to Google BigQuery.

This approach has a number of advantages — first, with the Doc2Vec neural network approach, very little time is spent on pre-cleaning the data with stemmers, tokenizers, etc. Secondly, the clustering step allows articles to be labeled at arbitrary levels of granularity, from coarse to fine-grained (Figure 2), which is helpful downstream when deciding how many topic pages to create or how best to summarize the content WSJ produced.

Screenshot of an article, with a list of inferred topic labels
Figure 2. Example article with coarse-grained topic labels “Urban Centers” and fine-grained labels “2020 California Prop 22.”

Finally, the Doc2Vec approach can gracefully accept new articles long after training and still generate relevant topic assignments. This won’t always work, in the case of brand new topics or words being discussed in the news, but at the coarse-grained level, the model can hum along in the cloud without issue.

And finally

Here’s a sneak peek at some of the results we’ve only now been able to visualize after building this model. While each section publishes approximately the same number of articles each week, topic-wise publication volume ebbs and flows over time (Figure 3), creating an improved lens into what we cover, and for whom.

Data visualisation of 15 line charts showing topic publication volume over time
Figure 3. Topic publication volume from Jan. 1, 2019, to Dec. 31, 2020, for 15 example topics.

Similarly, we’ve seen a lot of success in summarizing what content themes work well on different platforms, but have potential to be covered more to reach more audiences. We continue to dig in to better understand how this model behaves over time, particularly as new topics emerge in the news. Meanwhile, our topic labels are already being used as features for content strategy and product roadmaps, as well as features in our other data science models.

What do we do next? Keep building.

Tess Jeffers is Director of Data Science on the News Insights team, a part of the Journal’s Digital Experience & Strategy Team, responsible for actionable product, content, and audience insights and productionized data science models delivered at scale.

--

--

Tess Jeffers
WSJ Digital Experience & Strategy

Director of Data Science @ The Wall Street Journal. Former Insight Data Science Fellow, Princeton PhD in Quantitative Biology. Data wrangler, cat squeezer.