CDN Outage Post-mortem

Published in

Launch, by Adobe

15 min readOct 10, 2019

Adobe Experience Platform Launch is a tag management system. At its core, Launch produces a JavaScript file that you (the user) can use to control the marketing, analytics, and advertising technologies of your choice. When a client device (browser or mobile app) requests content from you, that file is retrieved and executed.

We built Launch to give you a simple way to manage the contents of that file, but you must host it somewhere for your site visitors.

Many Launch users have these files delivered to their servers or download them from a browser. Many others choose to have Launch manage the hosting for them. For this second group of users, the Launch team maintains a relationship with a 3rd party CDN. When Launch produces a JavaScript file, it gets shipped off to the CDN so that it is available to your site visitors.

During June and July, we experienced frequent disruptions of this part of the Launch service, particularly with delivering files to the CDN, having those files replicated correctly, and made quickly available for hosting.

In the spirit of honesty and transparency, we want to provide an accounting of those service disruptions. Today I will explain what went wrong, what we did to fix it, and what we are doing to ensure that it never happens again.

We’ll start with a quick summary of the issue for those who are short on time. Next, we’ll do a super simple explanation of how CDNs work and how Launch uses one. Then we’ll discuss the outages and the impact they had. After that, I’ll explain the root causes and outline the steps we’ve taken to fix things.

Fishbone diagrams are useful for root cause analysis

tldr;

Launch and DTM use a 3rd-party CDN to perform hosting of customer JavaScript libraries.

During June and July of 2019, the connection between Launch and our CDN origin had several disruptions that prevented us from delivering new files to the CDN.

The delivery of tags — including analytics measurement tags — from the CDN to user devices and browsers was not affected at any time.

A detailed root-cause analysis revealed that four factors were interacting to cause the issues we were seeing:

1) High hits/sec from CDN edge to CDN origin servers

2) High SFTP operations/sec

3) SFTP connection limits

4) Low isolation between tenants

The Launch team and our CDN provider have remediated the immediate issues to bring the system back to stability. Further actions are planned to ensure future scalability.

This topic is technical by nature. I’ll do my best to explain in simple terms but consider this sentence a friendly notice of incoming techno-jargon. You’ve been warned.

CDNs

The primary purpose of a CDN is to move content closer to the end-user so that it can be retrieved more quickly by a client device. The CDN achieves this by making copies of the content available on geographically distributed “edge” nodes around the globe.

How DTM and Launch use a CDN for file distribution to client devices

When a client device requests that piece of content, the closest edge node responds. In this scenario, “close” is measured by time, not distance.

The edge node caches a local copy of the content for an amount of time defined by the “time to live” or TTL. If the edge node receives a request for that content and the TTL has not expired, it serves the copy it already has. If the edge node receives a request and the TTL has expired, it retrieves a fresh copy from the “origin” and caches it for the next request.

For further scale (and not represented above), in real-world applications, there is more than one origin server. The origin servers replicate content between themselves. When an edge node needs a fresh copy, it makes a request to the origin for that content. The closest origin responds using the same mechanism client devices use with edge nodes.

Finally, there is a mechanism to invalidate the edge cache so that on the next request, it will be forced to retrieve a fresh copy from the origin.

If you are having Launch manage the hosting for you, publishing a library involves:

Compiling all your configuration and the code needed to run it into your library file(s).
Push the library to the CDN origin.
Invalidate the edge cache so that new requests will get the latest content that was just published.

All of these things must complete successfully before your build is marked as `succeeded` in Launch.

Outages and Impact

Earlier this year, we began to have intermittent problems with Step #2 above. When pushing library builds to the CDN origin, we would get upload errors, sluggish upload speeds, and other issues of this type.

These incidents did not affect client devices retrieving content from the edge but did negatively impact DTM and Launch’s connection to the CDN origin. During one of these outages, DTM and Launch builds would take an extremely long time or fail outright. Even when the upload succeeded, the content was often not replicated to other origins correctly.

We averaged one incident every six weeks or so. Our CDN provider told us these outages were caused by the origin region being unhealthy. They would point us to a different region, things would clear up, and we would move on.

When we began having these problems, our provider also recommended that we move onto a new/upgraded version of their infrastructure. With our approval, they began cloning all of our content onto the latest version in preparation for making that change.

In early June (2019), the number of incidents dramatically increased. Outages began occurring multiple times per week. With some urging from our executive team, our provider began working in earnest to identify root causes.

Root Causes

After much troubleshooting, trial, and error, we jointly determined that the origin servers were overloaded. The amount of traffic that Launch and DTM support is large and has grown steadily over time. And the way that they utilized the CDN’s capabilities was reaching capacity.

More specifically, we identified four things that we needed to remedy.

1) High hits/sec to Origin

The number of requests from edge to origin was extremely high. This high load had a few causes:

DTM staging libraries had a TTL of 0 so that DTM staging libraries would be available for testing as quickly as possible. It also means that every request for a DTM staging library hits the origin. Customers deploying their DTM staging environments on production systems makes this much worse, and there were a few of those.
We had low TTLs on all files so that newly published builds would show up quickly. Launch could support a higher TTL because it was built to invalidate the edge cache when a new build was pushed, but Launch shares TTL configuration with DTM, and DTM did not perform the same invalidation.
404 error multiplication. When an edge node doesn’t have a copy of a file, it requests the file from the closest (quickest) origin. If that origin doesn’t have the file, it performs an internal 302 redirect the next origin and so on. If no origin has the file, then a single 404 error is returned to the edge and passed onto the client device. So a single 404 error request has a multiplicative impact on the amount of traffic to origin servers.

2) High Number of SFTP Operations

Every time Launch or DTM connected to the origin SFTP server to upload a new build, they performed a lot of operations in parallel.

If you’ve ever downloaded an archived build and looked inside of it, you’ll see there are lots of different files and folders inside. Depending on your library, the number can get into the thousands (some of you will think that’s crazy, and some of you will just shrug).

During a build upload, the system will:

1) Check to see if three different folders exist — three slow operations each for a total of nine slow ops.

2) Write library files & symlinks — two main library files (minified and unminified) and 2–3 symlinks per file depending on your setup. The symlinks are there for backward compatibility.

3) Write custom code actions — totally dependant on how much custom code you’ve written. We have some customers where this number runs into the thousands.

So, writing a library file would:

trigger more than 9 slow folder check operations
write 6–8 files/symlinks
write n custom code actions

In summary, uploading a build would cause a spike in resource usage on an origin server. The bigger the build (especially more custom code), the bigger the spike.

3) SFTP Connection Limits

As DTM and Launch usage increase, they make an increasing number of simultaneous SFTP connections so that they can upload more than one build at a time.

Unbeknownst to us, the older version of the CDN infrastructure that we were on had a limit to the number of simultaneous connections, and we were running into that limit.

4) Low Isolation Between Tenants

Also unbeknownst to us, the older infrastructure that we were using had a lower level of isolation between tenants. Low isolation means that higher than average traffic for one CDN customer could impact other customers that were using the same hardware.

Said differently, if we are on the same hardware, we are in the same boat. My outages are your outages, and yours are mine. We experienced some outages during our low traffic periods, so this also played a role in the number of incidents.

Remedies

It was clear that we could not allow things to continue as they were. No one on the development team was getting much sleep because they were spending nights and weekends in the war room. Customers were having slow and failed builds, which wreaks havoc on their daily work and releases.

For each root cause that we identified, we took immediate steps to resolve it. Some of these things will remain long-term, others will be phased out over time as other work is done to put longer-term solutions in place.

1) High hits/sec to Origin

We made a few changes immediately to reduce the amount of traffic that the origins were dealing with. I’ll spend the most time in this section because these changes had the most significant impact. And also because they introduced a new — smaller and less critical — issue that needs to handled with further enhancements.

Status: Done — Increased TTLs for Libraries
The first, most obvious step was to increase the TTL for the hosted library files. The TTL for Launch files and DTM Production files was one hour. We changed it to six hours. The TTL for all hosted libraries (Launch and DTM, all environment types) remains at six hours and will increase in the future.

This change was transparent for Launch because Launch was already invalidating the edge caches when it built new libraries.

The change was not transparent for DTM staging libraries because there were not previously cached at all (they had a 0 TTL). Introducing a TTL on DTM staging was necessary because it was causing a disproportionate amount of origin traffic. Some customers had deployed staging embed codes in a production setting. Without a TTL, this was wreaking havoc. Although these customers were sent strongly worded communications asking them to stop, the TTL was a more immediate and reliable fix.

The result of this change was very long wait times to see DTM updates in the staging environment, so we also made the following change.

Status: Done — Update DTM to use cache invalidation
We implemented cache invalidation in DTM. Launch and DTM now behave the same. This update is complete. Wait times in DTM have returned to normal.

Status: Done — Move to new infrastructure
As mentioned above, we were on an older version of the CDN’s infrastructure. They suggested we upgrade to the latest version. We spent significant time testing and validating this change to ensure that the transition would be seamless. It was.

However, hosting on the new infrastructure has a unique new side effect.

I mentioned above that in real-world applications, there is more than one origin server and that they replicate content amongst themselves. On the older infrastructure, we uploaded all libraries to a single origin region, and they replicated from there.

On the new infrastructure, we upload all libraries to a load balancer that intelligently chooses which origin to use based on conditions at the time. After Launch uploads a library to Origin A, Origin A will replicate it to Origin B, Origin C, and so on. The replication process happens asynchronously. Launch does not know about status or progress.

Since Launch doesn’t know the status of the replication job, it moves on to invalidate the edge cache. When you request this new library through the browser, the cache will be invalid, and the edge will forward your request to the closest origin. If any origin besides Origin A is closest, then the conditions are met for this potential issue to become a real one.

To illustrate, we’ll say that your browser request for the new library routes to Origin B — a different origin than the one where Launch sent your library.

If your request hits Origin B before Origin A syncs the latest version, your browser gets the old copy again. And to make matters worse, the library is now cached on that edge for six hours.

The issue can occur for any library but is most disruptive to the development workflow, where you are trying to test new changes quickly.

We are implementing a fix for this new issue now. See “Multiple cache invalidations on publish” for details.

Status: Done — Introduced TTLs on error responses
We introduced a five-minute TTL on the error responses flowing through the system. This TTL dramatically reduces the traffic to origin but also contributes to the replication problem I just described.

Browser requests for a library in a brand new environment (never used before) can route to an origin that has not seen a file with that name yet. If this happens, you’ll receive a 404 error code that gets cached on that edge for five minutes.

Status: Done — Multiple cache invalidations on publish
The CDN does not provide insight or estimated timing around the replication of files between origin servers. The simplest way to mitigate the stale cache problem is to make multiple purge requests when a new library is published.

We are working on this now. When done, you won’t have to wait for 6 hours for the CDN edge cache to clear. We’ll invalidate the cache immediately as we do now but then do it again five minutes later. And again 60 minutes after that. If you miss on the first try, the subsequent cache invalidations will give you another try in a few minutes. We may tune these intervals over time as we collect more data.

Status: Done — Retry queue for cache invalidations
Cache invalidations can also fail/timeout, so we’re implementing a retry queue with a backoff strategy to try again. This change will be released shortly, likely at the same time as multiple cache invalidations above.

Status: Done — Further increase in library TTL
After the above changes with multiple cache invalidations and a retry queue, we further increased the TTL to 24 hours.

Status: Planned — Publish empty library on environment creation
When Launch creates new environments, it will publish an empty library. That empty library will prevent you from getting a 404 error when you request the library through the browser. And if you don’t get a 404, then the edge can’t cache it.

This idea was already on the backlog, but we’ve boosted the priority and will work on it soon.

2) High Number of SFTP Operations

I mentioned above that we were performing a lot of SFTP operations when we publish a new library build due mostly to the number of files in a build. Initially, we made two changes to optimize the process. When those changes did not move the needle as much as we wanted, we made two more changes.

Status: Done — Synchronous uploads
I mentioned above that publishing a new library involves uploading hundreds or thousands of files asynchronously in parallel. At our CDN provider’s request, we switched these to synchronous uploads in sequence to reduce the resource usage on the origin.

The change had the desired effect of reducing resource usage. It also had the undesired effect of dramatically slowing down the upload process. Large library uploads became a trial of endurance, uploading thousands of files one at a time in sequence. The change also exacerbated the problem with SFTP connection limits (more on that below).

Status: Done — Optimized SFTP commands
Above I described the file writing process as:

more than 9 slow folder check operations
writing 6–8 files/symlinks
writing n custom code actions

We were able to replace our nine slow folder check operations with one fast one, so the new number is:

3 fast folder check operations
writing 6–8 files/symlinks
writing n custom code actions

There isn’t much we can do to reduce the number of files/symlinks we write or the amount of custom code that customers use.

Status: Done — Serve from Archive
These two changes above had the effect of making things stable but much slower. To get the upload speed back, we made an additional change.

Our CDN origin servers can store .zip files and serve the edge caches from that zip. When you upload one of these zips, the CDN origin will index the contents and provide those contents to an edge server when requested.

Launch now uploads your libraries (and the CDN stores them at origin) as a zip file. There is no change to the cached contents stored on the edge nodes.

So the final file upload equation is:

3 fast folder check operations
writing 6–8 files/symlinks
writing 1 zip file

That is a maximum of 12 total operations per build, down from 10 + a potentially vast unbounded number.

After some extensive testing, we made this change also. In response, upload times and SFTP operations per second both dropped significantly.

Status: Done — Move to new infrastructure
The latest CDN infrastructure handles more operations / second than the previous setup. Also, individual operations execute faster.

3) SFTP Connection Limits

When attempting to upload a new library, DTM and Launch were frequently waiting for an available SFTP connection to perform the upload. Waiting for an open space was adding to build time.

Status: Done — Actively clean up SFTP connections
Previously, when we uploaded files, we did not actively close the SFTP connection. That left the SFTP connection open until it timed out. Launch now closes the SFTP connection immediately after upload. That connection goes back into the pool for another upload process to use.

Status: Done — Move to new infrastructure
The move to the latest CDN infrastructure came with a much larger pool of available SFTP connections.

4) Low Isolation Between Tenants

Status: Done — Move to new infrastructure

The latest CDN infrastructure has better isolation between tenants. We now have a much smaller impact on our neighbors and vice versa.

Potential Future Enhancements

The changes we have made so far have returned us to a stable condition. The in-progress and planned changes will buy us some extra time and breathing room.

But one thing we have learned from all this is that the system does have limits. And since we expect traffic to continue to grow in the future, we need to look at the future and plan appropriately.

Status: Future — Simpler self-hosting

DTM and Launch customer who self-host were entirely unaffected by the outages we experienced through June and July. We want to make it easier for other customers to do this if they choose.

Launch can currently use an SFTP Host to deliver Launch builds directly to your servers. But setting up an SFTP Host is time-intensive and requires a bit of expertise. We can make this simpler.

We will also add new Host options so that you can connect directly to your CDN if that’s what you’d like to do. We’ve discussed Akamai, CloudFlare, CloudFront, and others. We may also open up this part of the stack so that customers can make contributions to the list of valid Hosts.

Status: Future — Sharding across multiple accounts

Another option we’ve discussed that seems pretty sure is to shard our customers and environments across multiple CDN accounts. Having a CDN account isolates us from our neighbors. Having multiple accounts would allow us to isolate development environments from production environments (as one example).

Status: Exploring — 1st-party solution for dev libraries

There are a few other ideas we’re throwing around that may come to fruition in the future.

Production libraries which are accessed all the time, but change very infrequently and are more tolerant of time delay when published.

That scenario is significantly different from development libraries, which receive almost no traffic but change frequently. And waiting for a dev library to build and deploy is as exciting as watching paint dry. So we want that to be as fast as possible.

Using a CDN for the production use case is a slam dunk. The development use case doesn’t fit as well. Perhaps we could come up with an alternative hosting solution for development libraries that didn’t have to deal with the complexity of edge caches and origin regions.

Status: Exploring — Adobe-owned Origin

If we owned an origin, there are a few intriguing possibilities that open up. That move would allow us maximum flexibility between many CDN providers and reduce the number of failure points in the publishing path.

Wrapping Up

Launch delivers code for tens of thousands of sites and mobile applications. I hope this has given you some insight into the complexity behind that and the work that goes into making things more robust and reliable.

These outages also highlighted that our process for communicating with you during a disruption leaves something to be desired. We’ve put some new procedures in place so that during future incidents, we can keep you informed about what’s going on and what to expect. Expect to see communications through the forum with official details and messages in Slack workspaces to point you to the forums for more information.

Thank you for using Launch. We hope you like using it as much as we like making it for you. Please find us in our forums and Slack communities (general, developers) to tell us what you like and what you want us to work on next.

Happy tagging =)

CDN Outage Post-mortem

tldr;

CDNs

Outages and Impact

Root Causes

1) High hits/sec to Origin

2) High Number of SFTP Operations

3) SFTP Connection Limits

4) Low Isolation Between Tenants

Remedies

1) High hits/sec to Origin

2) High Number of SFTP Operations

3) SFTP Connection Limits

4) Low Isolation Between Tenants

Potential Future Enhancements

Wrapping Up

Written by Ben Robison