The road from Akamai to GitOps

Like most larger organisations, Volvo employs a Content Delivery Network (CDN) to distribute its web-based services. As such, one of the biggest challenges is engaging the teams utilising the service in a way that involves them without causing problems for each other or for the CDN team in charge of the entire system.

A naive setup…

The CDN solves a wide range of problems, ranging from the difficult to the impossible, for almost any company not specialised in content delivery to handle. Without going into too much detail, imagine you are distributing data from a web server you have set up in your local office on your local computer system.

Your service is exposed to the internet, just plain, as is, and the entire world is invited to browse your system. Connections will thus be direct, unfiltered and unbound by any restrictions.

World map illustrating connections from many locations to a single server

Initially, everything works just fine, but as interest grows, so does the load on your system. You try to manage it by ramping up your server capacity, adding more servers and setting up load balancing between them, splitting your service into parts which are served from different systems and so on, but the load keeps going up. Maintenance time and costs skyrocket, and on top of that, you are starting to get other problems.

Not everyone out there is playing nice. The world, unfortunately, is full of people with criminal intent, continuously probing the internet for soft targets. You quickly experience all sorts of problems, ranging from scraping to DDoS-ing to attempts to install malware or ransomware. To handle this, you need to ramp up your cyber security, investing in a desperately needed new skillset and installing defences. In reality, you will likely suffer recurring bouts of downtime while you try to handle the onslaught. Maybe you’ll keep up with the attackers, maybe not.

CDN enabled setup

A better approach then is to make use of a Content Delivery Network, a CDN.

Illustration of CDN-enabled service

In this model you no longer expose your services directly to the general public. Instead they connect to an “edge server” placed relatively speaking close to their location. The CDN provider uses DNS services to find the closest possible location for a server they provide to anyone who makes requests to a managed domain. For most any of the bigger providers the distance usually can be measured in tens of kilometers at the most, if you live in a bigger city probably a lot less. These edge servers in turn connect to a network of other servers which end up connecting to your system. As requests come in, the systems reaching your server caches as much data as possible, distributing it to the various other servers in the CDN. That means that once the first request has been made from somewhere in the world it’s likely your server won’t be queried for the same data again in quite a long time. At Volvo static data usually reaches a level of 92–97% cache ratio, so called “offload”.

Obviously the CDN cannot cache all types of data. Some is marked as non-cacheable, usually API calls, but it’s a lot easier for you to handle that compared to the rest of the traffic. In addition the CDN knows the shortest “distance” from a user to the servers and can route traffic the most effective way, accelerating the calls and might even supply other types of mechanism for improving API type traffic response times.

So what about the hackers? Well, they no longer can connect directly to your servers; the server’s identity is not advertised; thus, the hackers’ only option is to attack the CDN. Either directly against the edge servers or trying to find a way through them to your system. However, in our case, our provider, Akamai, keeps a continuously updated Web Application Firewall along with a Bot Manager running at the edge level. So far, no attacks have ever managed to harm our systems. These attacks are continuous, day and night, all year round.

Our main site

As for our main site, well, it’s in no way a simple system consisting of a simple data store delivering static html pages.

No, it’s comprised of dozens of servers connected through a large set of rules (100+) diverting and augmenting requests based on various criteria. Mainly we employ path matching to find a specific server, an “origin”, to which requests are directed, but as with any complex system, we also filter on headers, cookies, etc. Furthermore, we can manipulate request headers and other parameters, both going to and coming from the origins.

In addition, we manage some 20.000+ redirects in the system. A redirect is where you request one URL but are diverted to another. Those mainly are used for backwards compatibility of systems which have been updated or replaced by other services.

As you might expect, we also manage quite a few other domains. In fact, some 20+ systems are handled as part of the CDN services we utilise. Some of these overlap in functionality (like Test, QA and Prod). Others are unique, supporting differing usages for different teams.

This setup might sound rather complex — and indeed is — but at the same time, it’s sufficient for just a small team of three people to manage it all.

The problem?

All of this sounds great, so what is the problem?

We use a CDN service called Akamai. In Akamai, all configurations are managed through a web GUI interface, providing a wide range of services beyond just the distribution of web-based data.

The problem here is the interface to the system. The best would be if we could invite the various development teams, who provide services through the system, to edit their rules themselves. But that’s not really an option, as we all know. The problem is actually problems, plural…

  • While not rocket science per se, the system does take quite some time to learn how to manage.
  • The system is all managed through a web GUI, which also takes time to learn.
Snippets of GUI-based configurations
  • We have so many origins and associated rules in the system, so it is rather sensitive to trivial mistakes. A simple typo can break part of or even the entire site.
  • The user interface is, very much, a single user, or at least “very few” user-oriented.
  • Clear communication about who is doing what at any time is a must. There is no way of locking a configuration, which means that with several users, we always run the risk of overwriting each other. With a small number of users, that is manageable. With many, it becomes a lot harder.
  • The access control system is of the “all or nothing” character, meaning that once you have access to making a change in one place, you can also make any change you like anywhere else. There is no way for us to assign rights exclusively to a small set of rules to the team who “owns” them. It doesn’t matter that the team might not intentionally mess with other parts of the system. The point is that they can.

What does all this mean, then? Well… it means the CDN management teams become a silo. Truly an anti-DevOps pattern. Distasteful, but there it is.

As a result… well… you can guess.

  • We suffer bottlenecking. As we, the CDN team, are the only people able to make changes, all teams who need work done have to wait in turn for us to handle it. Sometimes that can take a few days.
  • The teams have pretty much no insight into their configurations as these are locked away in system they cannot reach. While we could grant read-only rights, they would still need help understanding what they see. Time we really don’t have to spare.
  • The GUI-based setup itself makes understanding rule relationships and other setups difficult even for us who work with them daily. There is a LOT of scrolling and having multiple windows up going on!
  • The “truth”, the definition of the system, is all in the web GUI. While it does have its versioning system, it is cumbersome to use and, again, locked away from most people.
  • As a result, we get functional drift between configurations, most notably Test, QA and Prod, and between basic configurations in the various systems, which could have been more or less the same. We also suffer misunderstandings with teams, and sometimes mistakes go unnoticed for long periods of time.

Let’s fix it!

The solution, then? Well, we want GitOps, of course, or at least a process that gets us as close as possible to a Continuous Deployment cycle. We want a way to store the configuration in Git to achieve Infrastructure as Code (IaC) and make those configurations available to anyone who needs to handle them. On top of that, we could add automated pipelines for delivery or at least control some of the steps more efficiently.

Initially the most immediate problem was redirects. They needed to be entered one by one as multi-option rules in one of two web interfaces, either as rather elaborate individual rules in the main configuration:

Single redirect rule

or as noted in the rules, in a slightly more compact format in a separate tool, along with a reference in the main configuration:

Excerpt of redirect rules

We now found out that due to the lack of transparency and general lack of interest in keeping the redirects up to date, we had a massive list of some 70–80 thousand (yes, notice the dimension of uncertainty here) entries stored in various systems, many of those which simply hadn’t worked in years… A determined effort to clean them up was obviously a priority! It didn’t take long to realise that we would be spending all our time and then some just managing these redirects, in particular as we were facing a restructuring of our web site.

The special redirection service would make it easier but still require a lot of work. If we were to employ our support team for the job, they would need special training to learn how to use the tool, and besides, they would still be hampered by a lack of overview (scrolling through hundreds of meters of virtual lists).

Now on top of the old list, hundreds of new redirects would come in every week apart from the immense list of the old ones to manage… an overwhelming task!

After realising the lists of redirects could be exported/imported as csv files on a very compact format from the redirect service, I saw a possible way forward. I created an easier text-based format for redirects where one redirect was represented by a single line and created a program which could translate our syntax to the Akamai format and upload them using their API.

Now we along with our support team could take control of the redirects and with the sudden ability to actually see and compare what was in the system trim them down from the original 70–80k to the current number of ca 20k, split them into rule sets and even allow development teams access to them by placing them in Git (IaC, but see below). Still a huge number of redirects but finally manageable in a realistic way.

Unfortunately the process of uploading the redirects still was a bit compliated as they needed changes in both of the two web interfaces. One was now being handled with the redirects being uploaded to the special redirect management service, but an entry about which of these redirect sets should be applied for a specific path, the main problem being manually writing criteria selecting the proper redirect set.

Legal redirect selection snippet

To solve the problem, I decided to further develop the redirect management tool. I started by making it possible to enter the redirect configuration as a single operation to both services. The tool now automatically generates the required (rather massive) regular expression needed for redirect set selection, and then worked on expanding that to other functions as well. As a result we now have a rather specialised tool tailored to manage all our configurations using the Akamai API for updates.

Using that tool we now also can extract individual configuration rules as JSON or YAML. The format is up to the preferences of the teams who should own them.

> akamaier property extract -g web testlxp.volvocars.com "landing/ndc car configurator"
++ Waiting for property versions for group Web.
{
"version": "akamaier v2.1.3",
"type": "akarule",
"property": "web:testlxp.volvocars.com",
"section": "Landing",
"tag": "C77E5446-F8A5-4C77-8B55-6929E9898FE0",
"name": "NDC car configurator",
"description": "Car Configurator, car configurator team",
"criteria": [
{
"regex": "intl,M/build"
},
{
"query parameter": "json does not exist"
}
],
"behaviours": [
{
"origin": {
"host header": "origin",
"cache key": "incoming",
"domain": "cc-origin-test.com",
"gzip": true
}
}
]
}
> akamaier property extract -g web testlxp.volvocars.com "landing/ndc car configurator" -o yaml
++ Waiting for property versions for group Web.
version: akamaier v2.1.3
type: akarule
property: web:testlxp.volvocars.com
section: Landing
tag: C77E5446-F8A5-4C77-8B55-6929E9898FE0
name: NDC car configurator
description: 'Car Configurator, car configurator team'
criteria:
- regex: intl,M/build
- query parameter: json does not exist
behaviours:
- origin:
host header: origin
cache key: incoming
domain: cc-origin-test.com
gzip: true

As Akamai lacks a way of clearly identifying the constituent rules which make out the larger configuration, I added a simple way of tagging them as part of the comment section associated with every rule. In that way, I can ensure that rules don’t get overwritten, lost or duplicated when deployed back to the Akamai configuration.

The rules are simplified versions of the actual rules, reduced to reflect our usage and make them more generic and easy to handle.

These rules get stored in Git, making a GitOps process possible.

Gitops diagram

Individual teams are granted write access to their rules, and so they are now in control of the content as and when it needs to change. They get insight into their configurations and the other teams’ setups as everyone can read but not write everybody else’s files.

This means the truth is not in Git, meaning we now have IaC.

Unfortunately, due to the sensitivity of the system, the CDN team still needs to be in charge of deploying the files. The teams modify the files, create a pull request, and then the CDN team inspects it before deployment using that special tool. This ensures system functionality while still cutting down processing time significantly, aside from the obvious benefits of transparency.

Where are we now?

Currently, the project is in a beta state.

  • We are extracting configurations as quickly as possible, assigning them to the teams which should own them.
  • We are dogfooding, using the configurations as the source of truth for any changes.
  • We already keep all redirect rules in Git, mainly updated by our IT support team.
  • We are about to invite a few teams for testing, mainly they will be reading their configurations and the associated documentation but obviously also making such changes as they need from that point onwards.
  • The last step will be a demo of the system for a larger audience and along with that a general roll-out to all teams.

In the future we envision a way of further automating the process. On PR approval a pipeline could be installed to actually deploy the changes as well, but for now, we are happy to just have made parts of the configuration available to the development teams.

Unlisted

--

--

--

From concept to reality. Take our developer capabilities and learnings into your product, learn how to build a product into our ecosystem, or just read about how Volvo Cars create production ready code every day.

Recommended from Medium

EOS Multisig Added on BITHD!

BOOTLEG: Perfect* copies of your favorite NFTs.

[Ongoing]Fingerprint on Ubuntu with Lenovo X1 Gen 5 Yoga

Networking Basics

PhalaWorld Opened Spirit NFT Claiming, Join Us For the 3- Day Polkadot Decoded Festival|Phala…

Htc One M7 Fastboot Drivers

TryHackMe: Blaster

{UPDATE} Word Search Hack Free Resources Generator

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ronny Wikh

Ronny Wikh

Senior devops engineer, team lead Digital Devops @volvocars.com