Taking the grind out of article history investigation through Pub/Sub centered architecture

How The Telegraph used an event-driven system to save time and keep track of our stories’ publication

Georgios Makkas

Published in

The Telegraph Engineering

4 min readOct 11, 2022

Credits to https://unsplash.com/@krystagrusseck. Used under the Unsplash License

When old news is bad news

A major, multi-week event was underway when a call came in from Editorial: it looked like an old version of a story was appearing on a syndication partner’s feed, but which version?

A bit of context. At The Telegraph, as well as publishing news across in-house print and digital platforms, syndication means that our stories appear in publications around the world. A vital job of the Platforms team is to make sure that our articles reach our syndication partners safely. With stories evolving on a 24-hour basis, it’s also imperative that the content we are sending out keeps pace with our own reporting.

On receiving the call from Editorial, we found 2 things; firstly, the article appeared to be out of sync, and secondly, getting to the error’s root cause lay behind a myriad of logs and we could not provide immediate answers. At the same time, this scenario highlighted the need for an altogether better solution. It marked the birth of the Content Status project.

A simple event-driven solution

Keen to move on from a web of log entries, we decided to keep the design simple. Events would be emitted from each microservice with metadata information about the articles being published to our partners. Those events would be captured by a service that would store them in the proper format in a storage space. The same service would provide the relevant endpoints to expose the information.

Having worked with Google Cloud’s Pub/Sub before, we felt confident that it would serve as an event queue that would allow us to decouple the event processing from the article publishing. Such a decoupling was crucial; we did not want to impede the high performance of the microservices serving our syndication partners.

For storage, we opted for an Elasticsearch cluster. With the amount of events expected, Elasticsearch could scale without issue, handling this considerable amount of documents and at the same time providing the search capabilities that Elasticsearch is famous for.

As for the event processing service, we went with the familiar Springboot project, deployed in a Kubernetes cluster. The horizontal scaling capabilities native in Kubernetes would allow us to handle the message load (or so we thought at that moment. More on that later…).

To expose the information, we decided to use an internal tool called the Toolbox. It is an in-house solution used to display all kinds of information regarding our articles, as well as provide several functionalities for different departments.

The following is the finalised system design.

To accommodate the publishing of events, we created a simple library and put it to use in our microservices. With a simple interface, information about article processing could be sent along with any useful metadata (success/failure, error message, etc).

All of the events were published in one particular topic, to which the event processing microservice was subscribed. Google Cloud Console provides excellent visibility when it comes to the amount of messages published, acknowledged, failed and others.

Too much information? Multithreading to the rescue

All seemed ready, so we flipped the switch.

At that moment, a major issue became apparent.

Events waiting to be processed. Count in millions.

The load on the event processing microservice was huge. Even multiple replicas of the service were not able to handle the amount of events being published. Production conditions can be a very unpleasant surprise.

We decided to increase the number of processing threads. For those of you familiar with multithreading, you now realise that the number of threads is not as apparent. To play it safe, we simulated an approximation of the traffic on our staging environment and were able to find the golden ratio.

Events actually being processed. Count no longer in the millions.

After the change, the number of unacknowledged events started to drop sharply. Soon, the service was operating without issues.

The answers we need in one place

With the service and UI implemented, tracking an article’s syndication history became a breeze. Using an article Id, under the History tab, users can easily track what has happened.

Using the tool, we are able to identify:

If the article was published to one of our Syndication partners
The latest version of the article that was published
An initial indication of any failures
Immediate feedback on the state of an article regarding our Syndication partners

Now, if Editorial or a Syndication partner needs to know what happened to an article, we can provide a much quicker and more precise answer. On a happy side-note, we are much more able to verify that a release is not affecting article processing.

So from the pain of trawling through logs, we have managed to implement a system that now provides at-a-glance clarity and reassurance for The Telegraph and its partners, while giving any insights for any potential errors down the line.

Georgios Makkas is a Platform Engineer at The Telegraph