Wrangling Audience Usage Data for Reproducible Analytics

Published in

WSJ Digital Experience & Strategy

4 min readFeb 22, 2021

In its raw, unfiltered state, The Wall Street Journal’s usage data is intimidating. One month of data about how readers consume and interact with platforms like the Journal can be several terabytes, or if you want to think of it in a spreadsheet, picture more than one trillion rows and over a thousand columns.

In the News Insights team at the Journal, we took a three-pronged approach to make it easier for our news, product and data teams to analyze our user data: teaching an intro to SQL (a programming language used to access data), teaching an intro course on web analytics and onsite tracking concepts, and preprocessing the usage data into more-manageable pieces.

In this post, I’ll explain how we took the final step, preprocessing the data into smaller, manageable pieces, by creating what we call “Fact Tables.”

There are millions of reader actions on our platform every day. We look to readers’ actions to understand what they value in our experience and what we can do better. Those actions create a “clickstream” feed each day, which is managed by Dow Jones data engineers. Every morning, the previous day’s data feed is deposited in raw form into Google BigQuery, a high-powered cloud data warehouse with standard SQL syntax.

It’s great to have the data stored in one place, but making it usable in the newsroom is an important undertaking we on the News Insights team have been working to make easier. The data is accessed using SQL queries and our first step in making the data accessible was to identify that we were writing more SQL queries than was necessary. Data analysts and data scientists across the organization were frequently writing similar, if not identical, complex and verbose SQL queries, which wasn’t efficient, could introduce error and took up a lot of bandwidth executing those queries over large amounts of data.

This amounted to having too many people trying to enter our data lake, so to speak, looking for the same thing. It was costly, time-consuming and can be error prone.

We set out to address these shortcomings by creating Fact Tables, which is really a miniature data lake containing tables of cleaned-up events — or “facts” — and tables of pre-computed aggregated metrics. Before the Fact Tables, a common SQL query for a routine analysis might have been more than 200 lines long — a cascade of “case” statements, concatenations, regular expressions, “where” clauses, “union” clauses and all manner of SQL joins.

We can now aggregate the same data in a dozen lines of code or less.

The picture below speaks a thousand VARCHARS (I get to throw in at least one SQL pun).

Here’s the size of a routine SQL query on raw data before Fact Tables:

Here’s a comparable SQL query after Fact Tables:

The Fact Tables save query resource time by breaking out Journal events from Dow Jones data (which also includes MarketWatch and Barron’s), and segmenting important types of onsite user actions into distinct sets of tables: pageviews, interactions (for example, sharing an article to social media), and conversions (a subscription purchase).

Furthermore, more complicated “pathing” metrics, such as click-through rates (navigating from landing pages to articles) and recirculation rates (navigating from one article to another one), are pre-aggregated daily.

Flow chart showing difference before and after fact tables

A quick overview of the process: The Fact Table codebase is hosted by an Amazon cloud-computing service (EC2) and executes daily via cron — a function that allows us to schedule it to run automatically. That codebase interacts with BigQuery via a Python API to programmatically create tables, execute queries and ensure necessary processes have been completed.

The work is paying off. The Fact Tables were widely used to facilitate an extensive analysis of Journal content last year, and continue to serve as the underlying foundation for the News Insights team’s reporting and analytics needs. An intern who worked with us commented that the Fact Tables were the single most useful tool he worked with during his project.

Most importantly, we hope the Fact Tables have improved data consistency and trust while ramping up productivity for future generations of data connoisseurs at The Wall Street Journal.

With smarter pathways to our data, we can better serve our readers.

Brian Turner is a data scientist at The Wall Street Journal.

Wrangling Audience Usage Data for Reproducible Analytics

Written by Brian Turner