Leveraging serverless Google Cloud Platform features for fun and profit

How Telegraph Engineering used the benefits of the serverless paradigm to get rid of technical debt and reduce operational costs

Georgios Makkas
The Telegraph Engineering
7 min readDec 6, 2023

--

Photo by Hazel Z on Unsplash. Used under the Unsplash License

Just a simple service

When object cache is mentioned, engineers tend to think of static assets. Images, videos, text. All of which can be combined to make reports, stories, opinion pieces and much more. Put that all together and it is easy to see an online newspaper on the other end.

As a news organisation, the Telegraph relies on layers upon layers of caches to serve content all around the world.

As several people have observed (including Martin Fowler), cache invalidation is a real challenge. In the Telegraph’s context, certain actions require the cache to be invalidated on demand. For that purpose, the Platforms team constructed a cache flush service to clear the cache of several layers on demand.

Seems simple enough; a service that clears caches on request. Yet the years have not been kind to it. Feature creep, dead code and quick patches took their toll and maintaining the service was becoming a real chore.

Tech debt is never fun

A rolling stone gathers no moss, goes the old proverb, and there is no better saying to express the need for an organisation to be fast-moving, lest it be left behind with no hope of catching up to competitors and, more importantly, customers. The Telegraph is no different in that regard. Just as the torrent of news never stops, the organisation needs to be on the cutting edge of technology to secure and improve upon the competitive advantage that makes this news organisation unique and successful.

Striving to be ahead of the current has great benefits, but there are a lot of challenges as well, readily apparent for anyone who has worked in technology for a while. One of the most glaring ones is the accumulation of technical debt.

Every engineer who has been around the block knows the feeling. The next task arrives and there is a need to improve upon a project with no commit in years. Documentation is non existent. A defeated sigh, a couple hours (or days) of struggling and the new feature is there. The feature is shipped, and the project is forgotten for a while. A new feature request comes. Now, the changes take a bit longer. A patch here and there, no reason for a full solution now. Then, a new feature request.

A couple of feature requests down the line, patch upon quick hack, it becomes literally impossible to modify the functionality without re-writing the whole thing. And new feature requests keep coming.

In hindsight, we should …

One of the benefits of working at the Telegraph is that when such a situation is detected, it is dealt with as a serious concern. As such, engineers are encouraged to re-evaluate architectural decisions of the past and, more importantly, request resources to deal with those issues.

Of course, no organisation wants to be bogged down by the constant modernisation of the technical stack, which is truly an exercise in futility; technical debt is part of the reality of software development. However, great organisations take into account concerns of such nature and, when there is an opportunity, give engineers the go-ahead to fix those issues and bring the technical stack up to speed.

A fresh technical solution, time-boxed, with undeniable benefits, is hard to turn down.

Examining the old

Re-architecture starts, as one would expect, by examining the service in question.

One must approach the old service with respect, trying to understand what made that service successful in the first place. It is important to keep the good parts and avoid the not-so-good ones.

A microservice deployed on Kubernetes, details abstracted for clarity

A RESTful Springboot microservice with a MySQL database is as standard a microservice as you can get. But even that basic design needs to be re-examined.

As clients communicate with the service using REST, that should be expected to remain the same for the new design as well. After all, communication contracts should only change if there is an absolute need for it.

Regarding the MySQL database, examining the service code reveals that the need for it is the communication with 3rd-party cache providers. Rate limits are always a concern; hence, there is a need to keep a steady rate of requests. To achieve that, the database works as a queue, with the service polling the database for stored jobs. From this, we can infer two points. One, rate limits are a concern and two, some clients do not mind asynchronous communication.

Other issues become apparent if one is inclined to observe the operation of the service. The daily requests are not a lot, 30 to 200; however, the service operates 24/7, consuming resources and accumulating costs. As such, the MySQL database queue remains empty for most of the time, with the maximum amount of queue size a measly 2 items. Maintaining a full MySQL database for this load seems like a waste of resources.

Infrastructure-wise, the service itself is deployed on a Kubernetes cluster, with the database deployed using the CloudSQL product. The Platforms team is doing an excellent job in maintaining an operational, up-to-date cluster, but as any DevOps team would attest, the inherent complexity of maintaining a Kubernetes cluster can be frustrating at times. Avoiding that complexity would be ideal.

To recap:
+ RESTFull
+ Rate Limits
+ Async
- 24/7 Operation
- Database Queue
- Infrastructure complexity

Who needs servers?

An important part of the day-to-day operations at the Telegraph is the Google Cloud Platform. As the cloud provider of preference, many of our services and operations rely on Google's building blocks to streamline development and operations.

Evaluating the offerings available, we realised we could leverage the serverless offerings to turn our requirements into a modern, low-cost architecture.

A new perspective

The low rate of daily requests meant that the service did not need to run 24/7. For that purpose, the team decided to turn the deployment into a Cloud Run service that scales down to zero. In that way, compute resources would only be used when needed. As an added benefit, it meant no more struggling with Kubernetes major version upgrades.

However, going that route meant that the Springboot application needed reconsidering. Springboot has a notoriously slow startup time and the JVM is not well suited to the strict cold-start requirements that the scaling down to zero paradigm requires. The decision was made to re-write the service in Python using Fast API since the team is experienced in Python development and the cold-start and response times are well within the acceptable limits. Also, re-writing the service would give the team the opportunity to simplify the software design and do away with any functionality that had outlived its use.

The next challenge to tackle was the rate limits of the 3rd-party APIs. As the investigation of the original service showed, the daily requests are not nearly close to reaching the rate limits, but the design needs to be secure against a burst of requests coming in. Such a scenario can easily happen on a breaking news day.

Given that clients are fine with asynchronous requests, meaning they can request a clear cache and not wait for the result, an asynchronous part can be added to the design that takes care of the rate-limiting issue. To avoid using a database as the old architecture did, an offering called Cloud Tasks was used to hold a processing queue. Cloud Tasks acts as a buffer queue, where the rate of delivery can be defined, along with a default behaviour to slow down requests when a 429 Too many requests error surfaces. All that remained was a small additional service that took care of the scheduling in the queue, which would also be implemented in Cloud Run using Fast API.

Finally, clients should have the option of calling the cache clear service directly or going through the asynchronous path. For that purpose, both services would be exposed using Load Balancers configured to handle TLS and drop unauthenticated traffic.

After careful consideration of the points above, here is the resulting design.

Simple, modern, cost-effective design

Out with the old one, in with the new

After the new design was reviewed and implemented, the implementation was moved to production for a dry run. When that was verified, the proxy was pointed to the new services, diverting the traffic to the new infrastructure. Monitoring using Cloud Logging, we verified that everything was operating as expected.

As expected, the requests are scarce, proving that the scale to zero Cloud Run design was an fitting choice.

Cloud Run — Cache Clear service

The Cloud Tasks queue demonstrates that our assumption of 1–2 concurrent tasks was accurate and easily manageable.

Cloud Tasks — Queue

With the new design, we were able to:

  • Reduce the cost of operation for the service by scaling compute to zero
  • Reduce the maintenance overhead of the service by leveraging Google-managed infrastructure
  • Shed most of the technical debt of the old service

Moving forward, we will consider all opportunities that the serverless paradigm offers to design architectures that are more reliable, modern and cost-effective, all the while keeping an eye out for old services in need of a new coat of paint.

Georgios Makkas is a Senior Platform Engineer at The Telegraph Media Group

--

--

Georgios Makkas
The Telegraph Engineering

Engineer, Avid Reader, always on the lookout for new technologies and interesting stories