Unveiling the first generation data architecture of a newspaper

Published in

NZZ Open

9 min readJun 26, 2023

In this article we describe how NZZ, Switzerland’s German-speaking newspaper of record, developed and improved its first data cloud architecture powering various data products. Use-case driven, iteratively, and modular.

Authors: Paweł Kaczorowski (former Lead Data Engineer) and Cristina Kadar (Machine Learning Product Owner & Senior Data Scientist)

Introduction

In today’s digital age, where automation and data-driven decision-making reign supreme, the underlying architecture that powers the increasing number of such use cases keeps evolving. As we find ourselves in the midst of a revamp of our internal data strategy and corresponding data platform, it is worth pausing for a bit to explore the beginnings of our data journey.

In this blog post, we will delve into the first-generation data architecture of a newspaper, and shed light on its intricacies and challenges. We focus on the data lake powering our internally developed AI/ML and advanced analytics products.

Evolution

Our cloud journey commenced in 2016, when NZZ was awarded a Google Digital News Initiative (DNI) grant to build its first news recommender system. It was during this period that our data team started leveraging the capabilities of the Google Cloud Platform (GCP) and began developing its initial data pipelines. With a subsequent DNI grant in 2017/2018 for a machine learning model predicting user propensity to subscribe, we continued to extend our capabilities in the cloud.

Ever since, we have iteratively improved and expanded the cloud architecture to support many new data products, including:

A wealth of recommender systems such as personalized article feeds in our customer-facing digital products (websites, apps, and newsletters);
Predictive models supporting our marketing efforts such as user propensity to churn;
Dashboards used by our editorial teams to monitor, optimize, and report online article performance;
Other internal tools, such as an NLP-powered system to semantically annotate our content and power down-stream data products.

Fast-forwarding to 2022, our data processing capabilities have expanded significantly, enabling us to handle massive amounts of data across various dimensions:

Behavioral data tracking: We track events on about 80 different technical domains and receive about 500 requests per second on average, with peaks of up to 2000 requests per second.
Behavioral data volume: The data platform stores about 200GB of data per day and this volume continues to grow.
Automated jobs: Our data platform encompasses around 100 scheduled pipelines for data processing, transformation, and analysis.
Users: About 100.000 unique users visit our digital products daily.
Read articles: Across all platforms, our users read about 28K distinct articles daily.

Challenges

The main challenges we have been facing as we kept developing the data platform can be broadly classified in the following categories:

Data volume: We initially started at an average of 20GB behavioral data ingested per day. However, as our platform evolved, we experienced exponential growth, currently managing 10 times that amount at 200GB per day and steadily increasing.
Traffic peaks: Our data inflow experiences irregular patterns. While we can predict daily traffic peaks in the morning when most users come online after waking up, unexpected breaking news events can cause immediate traffic spikes regardless of the time of day.
Process scalability: Our data architecture comprises a multitude of interconnected processes. It is crucial for us to efficiently handle both large-scale data sets and smaller data volumes, emphasizing cost-effectiveness and optimal resource allocation.
API reliability: The value of even the most sophisticated data product diminishes if client systems, such as front-end apps and websites, cannot consistently and reliably consume its output.
Code maintainability: This has been a critical requirement, prompting frequent architectural modifications. Our diverse team comprises Data Analysts, Data Scientists / Machine Learning Engineers and Data Engineers, without dedicated administrators overseeing every detail. As a result, we must ensure all components work seamlessly. Balancing specialization and proficiency across multiple technologies is crucial, as our engineers cannot become experts in every single tool.

Cloud architecture overview

Our architecture encompasses key components commonly found in a data processing stack:

an input layer which servers as the primary ingestion point and brings in data from various sources, including the behavioral data from the digital products and the text data from the content management system (CMS);
a storage layer which offers flexibility in terms of storage options based on specific goals and requirements;
a data processing layer which is the core component of the system. It performs all essential processing tasks, from transforming and aggregating the raw ingested data to applying AI/ML to derive actionable outcomes;
an output layer which serves the output of our data products to the down-stream systems.

To ensure the smooth operation and collaboration of these core layers, the following components are in place: scheduling is responsible for job automation, continuous integration/continuous deployment (CI/CD) enables rapid prototyping and stable deployment of new features and updates, and finally, monitoring provides visibility into the system’s status, enabling proactive identification and resolution of any issues or anomalies.

Input and output

In the input layer, the most complex task is the real-time ingestion of the user behavioral data from our news web pages and mobile applications. The data follows the W3C Digital Data Layer standard and is formatted in JSON.

Our data ingestion service is a REST API built on Java and the Spring Framework, deployed on a Kubernetes cluster. To handle unexpected traffic peaks, we automatically scale horizontally by spawning new pods. Fluentd, deployed also on Kubernetes, buffers and writes data to Google Storage. It is an open source data collector, which provides a logical separation between the ingestion web service and storage.

With the introduction of GKE Autopilot, we can now deploy Kubernetes artifacts without manual infrastructure configuration, transferring administration to GCP.

Further Python scripts and Java apps in the input layer take data from external REST APIs, FTPs, and other data sources. Similarly, in the output layer, our REST APIs and other applications provide the final results of our data products. These are consumed both in customer-facing products and in internal products used by different business units.

Storage

We employ various storage solutions to cater to different requirements within our data infrastructure, as opting for a single storage solution is not feasible. Here’s a brief overview of our main storage options and the rationale behind their selection:

Cloud Storage is our distributed hard drive, storing data at different processing stages and integrating seamlessly with Apache Spark, our main processing tool. We prefer saving data in Parquet format, partitioned by date.
For real-time access with low latency, we rely on Cloud SQL — GCP’s fully managed relational database service for PostgreSQL. It separates writers (distributed data processing jobs) from readers (APIs) by using a main read/write instance and a read replica. This configuration ensures a smooth reader experience, even during high data write peaks. Although there may be some replication lag, it is generally negligible based on our experience.
Acting as a key-value store and caching layer, Redis sits between the Rest API application and Cloud SQL. It improves user experience by implementing time-based eviction caches or manual invalidation, such as for daily reports.
With a focus on advanced full-text search capabilities, Elasticsearch is our preferred choice for data products that require sophisticated search functionality. We also use it to store and query additional NLP metadata, such as word embeddings.
For selected use cases, we use Big Query as our cloud data warehouse. Nonetheless, we are in the process of merging it with our on-prem data warehouse into one new cloud lakehouse powering all our AI/ML and analytics use cases — see the “The road ahead” chapter below.

Data processing

We rely heavily on Apache Spark as our main distributed computing engine for processing large data volumes. It is an open-source solution with a robust community and wide adoption across the major cloud vendors. Within GCP, we leverage Spark through the Dataproc service, which simplifies cluster creation and job execution. For smaller pipelines that run on a single machine, we utilize Python scripts and libraries such as pandas. For both approaches, we prioritize short-lived resources for efficient scaling and cost-effectiveness. Each job is executed within its own dedicated environment, such as a single Airflow executor or an Apache Spark cluster, and resources are promptly released after job completion.

Additionally, GCP’s introduction of serverless Dataproc mode for Spark has further simplified our workflow. With serverless capabilities, we no longer need to worry about cluster configuration, allowing us to focus solely on the processing code. We also started to embrace serverless technologies like Vertex AI for machine learning. These fully serverless products align with our long-term preferences, providing convenience and eliminating the need for infrastructure management.

Scheduling

For workflow orchestration we utilize Apache Airflow, which has become an industry standard over the years. It offers several key advantages:

Pipelines as code: Unlike other tools, Airflow pipelines can be fully defined in code. Compared to UI-based pipeline definitions, this approach ensures maintainability, especially for complex graphs;
Python programming language: Using standard Python as the pipeline definition language allows easy comprehension for all members of our data team;
Strong adoption: Airflow is widely adopted by leading companies and its further development is guaranteed since Google’s Cloud Composer is built on top of it.

We have over 100 pipelines scheduled on Airflow, deployed on Kubernetes. This flexible solution enables us to manage an uneven schedule distribution and diverse pipeline execution environments effectively.

CI/CD

To facilitate efficient collaboration among developers, we store all code and configurations (except passwords) in separate GitHub repositories. Our rule is that code merged into the master branch is ready for deployment into production. Code reviews are mandatory through GitHub’s pull request mechanism. We prioritize test-driven development and strive for good test coverage across various components. While we deploy code as needed, we follow simple guidelines like avoiding Friday or end-of-day deployments.

For CI/CD, we rely on Jenkins. Despite its age, Jenkins meets our requirements and offers a user-friendly UI. We use declarative scripts for pipelines, treating them as code stored in repositories and subject to code review. Jenkins handles package building, testing, and end-to-end deployment pipelines.

Monitoring

The data team is responsible for monitoring all data products throughout their lifecycle, from conception to end-of-life. To ensure effective monitoring in production, we rely on a combination of metrics collection frameworks, visualization tools, and alerting mechanisms.

Our key communication platform for system failures is Slack. A lack of errors on Slack generally indicates a healthy system. We have dedicated Slack channels to immediately report any software failures, regardless of whether they occur in pipelines, apps or APIs. Most of our investigations start from these alerts.

For long-lived applications, we expose metrics in Prometheus, which serves as our main metrics database. This data can be visualized and further investigated in Grafana dashboards. Prometheus, with the help of an alert manager, sends alerts (e.g., to Slack) when specific metrics exceed defined thresholds.

Although we have monitoring infrastructure in GCP, we recognize the risk of internal failures or bugs causing notification issues. To mitigate this risk, we use an additional external monitoring tool called Checkly, which independently validates the state of our critical APIs, regardless of our infrastructure.

The team

The architecture described in this blog was developed by our talented Data Engineering team, with its core contributors at the time:

The road ahead

While the data lake described above lives entirely in the cloud and relies on our in-house tracking, our analytics stack evolved separately and consists currently of two components:

an on-premise MSSQL data warehouse for the commercial data (such as subscriptions) with a reporting layer on top via Tableau;
Adobe Analytics for tracking and reporting in self-service of our behavioral data.

We have noticed that this two-pronged architecture leads to process inefficiencies and data inconsistencies that we want to address in our second generation data architecture. Specifically, we are working on one single source of truth for all our data in the future: a cloud lakehouse powered by Snowflake. Like this, both our AI/ML and analytics solutions will source the same, consistent data layer.

Furthermore, we will keep taking full advantage of GCP. For instance, finishing the migration of our Spark workflows to the serverless mode in Dataproc and deploying our custom Python AI/ML pipelines to Vertex AI and maturing our MLOps processes.

Watch this space for updates on our second generation data architecture!

If you liked this article, hit the applause button below, share it with your audience, and follow us for more insights.