Data Science for a Smart News Experience
This article is about how data science and Google’s DNI Fund helped NZZ, Switzerland’s newspaper of record, on its quest to become a truly data-driven publisher.
One of the most exciting places for data scientists to be these days is not the Googles and Facebooks and Ubers of the world, but companies like The Washington Post, The New York Times or Neue Zürcher Zeitung — traditional publishers, reinventing their roles in a digital society, while their old business model is eroding. Make no mistake, transforming traditional media companies to be more data-savvy companies is literally a matter of life or death.
A news stream that upholds journalistic standards and still increases relevance through personalization.
One particular area within news publishing that can greatly benefit from a more data-driven approach is content distribution. How can we help our readers to be better and more efficiently informed using data and algorithmic technologies? Contrary to many other personalization efforts in e-commerce or advertisement, our aim at NZZ is not to provide recommendations that are optimized to increase click rates. Rather we have been looking into engineering a personal news stream that upholds journalistic standards. A news stream that can be best defined as a personal news companion.
Data-Driven. From the beginning.
From today’s perspective this idea seems obvious, as everyone in the media business is talking about personalization. However, back in 2015, when we started brainstorming innovation projects, the idea for a personalized news stream was not quite as obvious, and really emerged from a few observations we made at NZZ:
- The vast majority of article views come from our landing page. And this is even true for subscribers. What does that mean? It means that apparently people rarely bother to actually browse through the department pages and prefer one “feed” for their daily news consumption. And currently this feed is optimized in a “statistical sense”: articles, which will be relevant to most people, potentially having the most reach, will be featured. Even if most = 51% of all readers.
- Landing pages are highly volatile. If you are not constantly browsing the landing page of your favorite news site, due to the rapidly changing selection of features articles on that very limited space, you’ll be missing out on a lot of interesting stories — this especially includes the ones that might be relevant to you.
- There is no “the reader”. Every reader is different. And every reader values different aspects of the journalistic service we provide. It’s our job to help you, personally, to be well informed. And this means also to provide a news selection that fits your personal understanding of relevance.
- There is no “one interest”. Even if we look at an individual level of readers, there is no “one interest”. Interest, and what readers consider relevant, is dynamic. In the morning you might be only reading the finance news, in the evening only sports, and on the weekend maybe mostly arts and culture. And to make things worse: it’s not only time, but context in general that determines the reader’s current information need.
- Mobile First = Small Screens First. This is a huge challenge. And one that I find is always underestimated when talking about mobile first. Physical newspapers have the advantage of space. In a fraction of a second my eye scans all the article headers on a page and my brain decides which articles to actually read. It’s perfectly efficient. On a large enough monitor, given a suitable page layout, this will work as well (and epapers are good examples of that), but on a small mobile screen: no chance. The linear stream is the only thing that is technically feasible. This implies that the selection of articles we show our readers is absolutely key to user satisfaction. Our readers are not going to scroll/swipe for 15 seconds until they find something interesting.
NZZ News Companion & Beyond
In 2016 we were lucky to receive a Google Digital News Initiative grant to start working on this vision, the NZZ News Companion. For us, the NZZ News Companion is the archetypal smart media product: providing everyone with their own smart, adaptive and context-aware news stream. Perfect information. No unnecessary distraction, but also no filter bubble — a news stream that is highly relevant and upholds journalistic standards.
However, from a technical perspective, this personalized news stream is only part of the story. We see the NZZ News Companion in a wider picture of what I call data products for smart content-delivery. Smart products are basically services that help us to automate and optimize content distribution using data:
- Personalized content curation. This includes all-in-one news streams like the personal news companion, personalized newsletters, personalized push, etc.
- Group-based automatic content curation. Based on certain interest groups, we can automatically curate our content. One example are recommendations for certain geographic groups, which we do very successfully already in case of our growing German readership.
- Topic-based automatic content curation. One example is automatic curation of content lists by topics or topic-based newsletters.
- Context-based automatic content curation. Context is tough, and something that is on our list of to-dos. Examples include: detecting a commuting context and automatically recommend an article selection that fits the expected commute time.
With the help of the Google DNI grant we were able to lay the grounds for our internal data product platform and beta-test an initial version of the NZZ News Companion. Basically we built an app for a group of beta testers in order to learn more about the needs and usage patterns of a fully personalized news products. Read this article by my colleague Rouven for a more general summary about the project.
Technology & Algorithms
I’d like to give some insight into the technology, and maybe more importantly, the “algorithms” that we use to build smart media products. If you got any more detailed questions, feel free to reach out to me personally.
Within the last months we built a general “data product platform”, that enables us to deploy many kinds of smart content delivery solutions. This platform is comprised of three pillars:
- A data-pipeline for importing/streaming in-house and external data into our data lake
- An internal code library for extraction, transformation and manipulation of this data to calculate the data products
- A highly scalable and customizable RESTful API to serve these data products
Number one, a reliable data pipeline, is the basis for any data-driven product. We made sure to have a stable, reliable and (as-close as we can get) 100%-available data pipeline (in the last 8 weeks we had an availability of 98%).
Number two, our internal code library, is our magic sauce. It entails all kinds of things: data transformations, statistics, machine learning, natural language processing, and building “the algorithms” to actually calculate something useful, such as article recommendations.
Number three, the API, connects the results of number two to the outside world and is the crucial part that allows our data products to be consumed by our frontend servers (or really any other service).
We are fully in-the-cloud and solely use open-source software, there are no proprietary dependencies. Especially with Apache Spark we are very happy. It’s the general purpose parallel computation platform we have been looking for. And even if, in most cases, we are actually not operating on “big data”, calculating recommendations for tens of thousands of customers is time consuming, and being able to comfortably parallelize these operations enables us to calculate them in a performant and timely manner.
The most important part when it comes to data products is the underlying algorithm, i.e. the way we chose to compute the article recommendations. These days when people talk about “algorithms” what they mostly mean are algorithms for statistical inference, i.e. algorithms that detect patterns from data. But these methods can be somewhat blackbox-y in the sense that you put data in and get some result out, without knowing why. Considering our goal to build a news stream upholding journalistic standards, this is somewhat discomforting. More critically, these types of algorithms are not designed to shape the future but replicate the past. This makes them more vulnerable to create effects like the filter bubble.
Machine learning algorithms are not designed to shape the future, they are designed to replicate the past.
This is why, in the case of the NZZ News Companion, we first spent much more time on algorithm design, which means understanding the desired user experience, and only then decided about what methods / algorithms can achieve this in an automated way. In our case we envisioned the following user experience:
- Readers should be recommended articles for up to two days, if the articles remain relevant
- Readers should see articles that match their interest early after publication and high up in their recommendations
- Readers can’t miss relevant news, even if they are not in their “proven realm of interest”
- Readers should always feel well informed and never fear being caged in a filter bubble
Boiling all of this down, the challenge translates into finding good measures for two principle concepts: general relevance and personal relevance. Once these would be quantified, we could put them together with some additional business logic into a recommendation algorithm to create the NZZ News Companion.
We found that general relevance, i.e. the idea that some content is of such great general interest and should at least be acknowledged by every reader, is hard to define algorithmically. How is a machine (without general AI) supposed to know what kind of content might be relevant given all the current and past context? I am not saying that it is generally impossible to create such a decision algorithm, it’s just impractical at this point in time. Especially given the fact that we have journalists creating our stories with exactly this knowledge of relevance in mind. So we opted to create a metric for general relevance that we called editorial relevance. It’s based on quantifying how articles have been placed on our website by our editors, and subsequently deducing relevance from this placement. Simplified one could say: the more prominent and the longer a certain article appears on our website, the more editorial relevance we give that article. We found this to be a rather simple, yet powerful concept.
Personal relevance is not less tricky, but at least we can use some machine learning here. The idea is simple: we observe what you have shown interest in the past (which stories you have read) and learn your interests. Always reading news from section “finance” in the morning and “international” in the evening? Great, we can pick up that pattern and adjust your recommendations accordingly. This can become much more complex when you take more features into account, like the author, length of an article, tags, and more. Also, using some advanced natural language processing techniques, we were able to closer match interests, beyond pure meta information. This allows us to not only recommend articles that match a users interest on a high level, but in a more fine grained fashion.
What did we learn so far?
The biggest learning really is about the goal of building a one-size-fits-all news solution. In German we have a funny word for this: eierlegende Wollmilchsau, something that is impossible to achieve. It’s hard to build a product that suits everyone, even if “everyone” is a set of readers that is in principle affine to our content. It’s just that the different ways people read our content, with different information needs at different times, prohibit a one-size-fits-all solution. There needs to be some sort of choice on the customer side. And the most straight forward way to achieve this is to create many more specialized “data products”, products optimized for different reading situations, that the user can chose from. This is what we will do at NZZ, and our new data product platform is the basis for that.
How will we go on?
We are going to continue development of the data product platform and deploy more smart content delivery solutions. Based on this system we already launched an automatically curated newsletter, optimized for a certain group of readers — outperforming all other newsletters in terms of click rate. And there are more things to come —so stay tuned and be surprised.