How News Agencies Can Leverage Data For Editorial Decision Making

Have you ever wondered how news agencies decide what stories their journalists cover? To produce in-demand content, it is crucial for any news agency to allocate journalists to stories and topics that are of most interest to its readers. However, it’s not an easy task to grasp where readers’ interests lie now, let alone predict where readers’ interests will lie in the future.

In Columbia Business School’s “Analytics in Action class, we (a team of three MBA students and two engineering students) collaborated with The Associated Press (“AP”) to tackle this immense and fundamental challenge for news agencies by putting the data analytics tools we’ve learned at CBS into action (as the course name may suggest). We investigated patterns in readers’ activities (a proxy for reader interest) on AP’s news platforms to create a tool that can help AP make smarter decisions about where to allocate their journalists. Here is how we approached this challenge and our findings after three months of “analytics in action”.

The Problem

AP is a not-for-profit news agency that offers fact-based news on (1) its end-user facing mobile app and website for free (the “B2C platform”) and (2) its business-customer facing AP Newsroom website via a paid subscription (the “B2B platform”). Examples of B2B customers are international and local news networks that distribute AP’s content on their own platforms. While the breadth of news coverage expands with globalization and digitalization, AP’s B2B customers (i.e. the source of AP’s revenue) are increasingly demanding in-depth story coverage. In other words, AP faces a resource allocation problem: where should AP focus its journalistic resources in order to deliver maximum value to its B2B customers?

The Data

To tackle this problem, we received 3 months (from June to August 2020) of user activity data for AP’s B2B and B2C platforms. The data included the number of page views and screen views for each article on the B2C platform, as well as the number of previews and downloads for each article on the B2B platform. Both B2C and B2B data included a timestamp for each page view, screen view, preview, and download. In addition, for each article we received data on attributes such as subject type (e.g. sports) and content type (e.g. investigations).

NOTE: We acknowledge that 2020 has been an outlier year as COVID-19 has received extensive coverage from news agencies across the world, including AP. Although insights drawn from our analysis of this given dataset may not entirely be representative of broader trends in general reader interest, we believe that the analytic tools developed through this project will continue to be valuable for AP in making data-driven decisions in the post-pandemic world.

Our Analysis

Although B2B customers are a primary focus for AP’s business model, richer data is available for the B2C platform. Thus, our initial thought was to see if we can use activity on the B2C platform as a proxy or predictor of activity on the B2B platform. If there is a strong correlation or leading/lagging relationship between B2B and B2C activity, then we could potentially use the larger B2C activity data to draw insights for B2B activity, specifically regarding what stories AP should focus on in the near future.

We first set out to clean up and merge the datasets we received to analyze the correlation between B2B and B2C user activity across different subjects, content types, and time of activity. We found that B2C activity is positively correlated with B2B activity for certain content types and subjects. However, the magnitude of correlation varied across topics, and the correlation, though positive, was generally not very strong. For example, the correlation between B2B activity and B2C activity is stronger for “Business” news than for “Sports” news, but correlation coefficients for each case did not exceed 0.7, which is generally considered as a threshold for a “strong” correlation.

Left: B2C-B2B Correlation For “Business” / Right: B2C-B2B Correlation For “Sports”

We also did not observe significant variation in correlation coefficients across days of the week, or time of day, i.e. the strength of correlation has been consistent regardless of time. Below (left) is an example showing B2B-B2C correlation remaining relatively constant across all days of the week for “General News” and “Government and Politics” articles. Below (right) shows similar trends across the time of day except for late-night hours (0–5 am) when the correlation is lower. These trends (or lack thereof) were consistently found across most news subjects and content types.

Left: Correlation across Day of Week / Right: Correlation across Time of Day
NOTE: The x-axis represents the number of hours that B2C activity leads/lags B2B activity, and the y-axis represents correlation strength

As a result, we needed to pivot our focus and find a way to identify topics that were rising in popularity solely from the B2B dataset. We started by brainstorming metrics that could be beneficial for our analysis and came up with the following three:

  1. Current popularity (as measured by previews per article)
  2. Growth in popularity (as measured by growth in previews per article over the past month), and
  3. Current level of investment (as measured by the number of articles written)

Our reasoning was: if we can identify segments that are both currently popular and growing in popularity but being underinvested in, then we could recommend AP to focus their resources on these topics.

At this time, through our discussions with AP, we learned about AP’s coverage priorities for 2021 (such as “COVID” and “Education”) and that they were keen to find a way to understand which of these priority topics (“macro-topics”) and angles (“subtopics”) would be most of interest to its customers. We received a new detailed list of keyword tags for each article in our existing dataset, and using this list, we were able to classify articles into the eight macro-topics that AP has designated as priority topics. For example, articles tagged with keywords such as “Pandemic” or “Virus” were mapped to the “COVID” macro-topic.

Furthermore, we were able to identify subtopics within the macro-topics by using a k-means clustering algorithm that groups articles into “clusters” based on calculated similarity between the vectorized (i.e. numerical representation of) keywords. The algorithm is further automated such that it compares 4 different cluster fitness measures to compute the “optimal” K (i.e. the number of subtopic clusters for each macro-topic).

Once we classified each article to its corresponding macro- and subtopics, we then started thinking about how to visualize user activity. After much debate on the ideal output format, we decided on a Tableau dashboard as shown below.

The x-axis represents an article’s current popularity (i.e. the average number of views per article) within each topic, while the y-axis represents the growth in popularity (i.e. the change in previews per article over the selected time period), and the size of each bubble represents AP’s current investment in that topic (i.e. the number of articles that were written on each topic). With this format, we can interpret that small bubbles in the upper right quadrant of the chart represent high-interest, high-growth topics that had been underinvested in. From the example above, we identified “Education” as a relatively high-growth, high-interest topic in July 2020, and given its relatively small bubble size (i.e. small investment), suggested that AP could potentially invest more resources to better meet customer demands.

However, what types of “Education” articles should AP write about more specifically? Our dashboard allows users to zoom into each macro-topic and see the same output for each subtopic (created by our k-means clustering algorithm). See below example for a zoom-in view of the “Education” macro-topic whereby our k-means clustering algorithm identified 3 subtopics. Each of these bubbles is labeled with the 10 most relevant keywords of the clustered articles. For the red bubble on the top right, we can judge from the labels that this subtopic is most likely related to the coronavirus pandemic, and given the high interest and growth we can identify “Education” articles related to the coronavirus pandemic would be an area that would benefit from further coverage by AP.

In Closing

Going forward, AP can continue to use our dashboard to support future editorial decisions, as both our dashboard and underlying code can be updated to reflect new emerging news topics. We are excited to see AP use this tool to help tackle its challenge of journalist allocation.

For us, it’s been a challenging yet valuable experience to actually use the analytics tools learned at Columbia to solve real-world business problems. At the same time, working with a team of students with such diverse skill sets has taught us the importance of leveraging each other’s strengths and learning from one another.

Last but certainly not least, we’d like to thank Ken Romano and Ava Tang at the AP for their support throughout this project and Professors Daniel Guetta, Brett Martin, and TAs Muye Wang, Arunesh Mittal for their helpful feedback and guidance throughout the semester.

The Columbia team that worked on this project came from various backgrounds across data science, consulting, and finance. Connect with them on LinkedIn here:

--

--