CDN MIGRATION: AKAMAI TO CLOUDFRONT WITH ZERO DOWNTIME

Published in

payu-engineering

11 min readJul 15, 2024

Introduction

In the bustling digital marketplace, where transactions flash faster than a wink, there’s PayU, your trusty financial sherpa, guiding billions of dollars safely up the e-commerce Everest.

Who is PayU?

PayU isn’t just a payment processor; it’s the backbone of countless businesses, big and small, ensuring that every click leads to a ka-ching. With a presence in 17 countries and a reach that spans the globe, we’re like the Swiss Army knife for online payments — reliable, versatile, and always ready for action.

Our Relentless Uptime

Imagine a world where PayU takes a coffee break. Chaos, right? No need to worry though, our system has an impressive uptime of 99.999%, making sure that your transactions are as smooth as can be. We don’t do maintenance windows; we’re always up and running because in the digital age, downtime is just not cool.

Our Traffic and Latency: The Nerdy Details

Now, let’s discuss figures, which are essential to any tech enthusiast. Each day, our system manages a high volume of traffic that would dwarf rush hour. With requests moving swiftly in all directions, our response time is incredibly fast, Our system averages 60–70 RPS with very low latency, though these figures change during peak times like Black Friday or Christmas. It’s not just about speed; it’s about delivering a smooth and effortless experience that encourages customers to return for more.

The Migration: Why It’s a Big Deal

But even the best need to evolve. Our upcoming migration from Akamai to CloudFront isn’t just a technical maneuver; it’s a strategic leap towards even faster, more reliable service delivery.

Strategic Objectives: Enhancing Testability, Efficiency, and Performance in PayU’s CloudFront Migration

1. Enhanced Testability: Our migration strategy prioritizes the establishment of a testable system where automated load and unit tests are integral, ensuring that every element is rigorously vetted before deployment to production. This approach guarantees not just functionality, but also system robustness and reliability.

2. Code-Based System: We are committed to a code-based system approach, where every modification and update is managed through code. We use Terraform to manage our code, it comes more naturally than with Akamai and really easy to manage the resources using it. This method enhances transparency, traceability, and control, allowing us to manage changes more efficiently and reduce the risk of errors.

3. Real-Time Logs and Metrics: A key goal of our migration is to implement a system that provides real-time logs and metrics. This capability will empower us to monitor our system’s performance continuously, enabling quick identification and resolution of any issues, thereby maintaining our service quality at its peak.

4. Optimized Performance and Latency: Our objective is to ensure that the performance and latency post-migration are on par with, if not superior to, our current standards set by Akamai. This goal underscores our commitment to delivering an uninterrupted, high-speed experience for our users.

5. Significant Cost Reduction: A pivotal aim of transitioning to CloudFront is to achieve a notable decrease in costs. This financial efficiency will enable us to allocate resources more strategically, fostering innovation and enhancing our service offerings.

These goals underscore our commitment to maintaining the highest standards of service quality and reliability while also optimizing our operational efficiency and cost-effectiveness through this migration.

Migration process

Let’s dive into the migration process. Our system was fully managed by Akamai for DNS and CDN, and the plan is to migrate both with the ability to control our traffic.

Certificate Migration

First, in order to be able to migrate secure traffic, we needed to prepare the relevant tls certificates in AWS. Our TLD is managed in Godaddy which delegates ns records to Akamai in order to resolve the domain and subdomain. After creating the new certificate in AWS ACM, we chose to validate it using DNS validation. This required to add a resolvable validation record at this point still in Akamai. The validation records were then migrated to route53 with all other records on the DNS migration phase. The validation records are used in the event of certificate renewal so we needed to keep it.

DNS Migration

After creating and validating the certificates, We started picking subdomains and shifted their DNS resolution to route53 with the use of NS records. We first picked subdomains that are less critical to the business and do not effect the actual payments flow. The records in route53 routed the traffic back to akamai, since the cloudfront distribution wasn’t production approved at this phase. As mentioned before, we aimed to shift traffic between Akamai and CloudFront gradually. To achieve this, we utilized Route53’s Weighted routing policy which allows us specify weights of 1 and 255. The weight of 0 was assigned to Cloudfront endpoint, and the weight of 255 was assigned to Akamai. At this point traffic was still 100% through akamai, but the gradual shifting mechanism was setup.

CDN Migration

Now it’s the time for CloudFront! But before delving into its details continued explaining how they used AKAMAI CDN:

For caching static websites
For routing

Akamai offers easy configuration and a wide range of out-of-the-box features, such as forwarding to multiple endpoints and fetching user IP for request headers. CloudFront, on the other hand, relies on coding Lambda functions for more complex configurations. This can be advantageous for flexibility or disadvantageous for ease of use. However, as a DevOps team with coding experience, we see this as an advantage.

Here how we shift the traffic to relevant internal endpoint — Example :


const FIRST_BACKEND = `first-backend.payu.com`;
const SECOND_BACKEND = `second-backend.payu.com`;

exports.handler = (event, context, callback) => {
    console.log("incoming request");
    console.log(event)
    let request = event.Records[0].cf.request;
    const headers = request.headers;
    console.log("succeeded fetched request and headers")
    if(request.uri !== undefined && request.uri !== '') {
        request.uri = request.uri.replace(/\/+/g, '/');
    }
    console.log('method: ' + request.method + ', origin: ' + JSON.stringify(request.origin) + ', URI: ' + request.uri);
    if(request.method === 'OPTIONS') {
        const response = {
            status: '200',
            statusDescription: 'OK',
            headers: {
                'content-type': [{
                    key: 'Content-Type',
                    value: 'text/html'
                }],
            },
            body: '<html></html>',
        };
        callback(null, response);
        return;
    }else if ( /^\/+dummy-url.*$/.test(request.uri) ||  /^\/+second-dummy\/login.*$/.test(request.uri)){
        request.uri = "/anything" + request.uri;
        request = changeOriginDomain(request, SECOND_BACKEND);
    console.log("return the response back to cloudfront")
    console.log('method: ' + request.method + ', origin: ' + JSON.stringify(request.origin) + ', URI: ' + request.uri);
    callback(null, request);
};

const changeOriginDomain = function (request, originDomain){
    request.origin.custom.domainName = originDomain;
    request.headers["host"] = [{key: "host", value: originDomain}];
    return request
}

The fun part, We can set it up as a pipeline with unit tests, so why not? After adding unit tests and setting up the pipeline, we were able to move it into production faster while ensuring that nothing got broken. To incorporate other Akamai features, such as for user IP, we simply needed to attach it at the relevant request level — easy as that!

We have all the necessary features, so perhaps it’s time to start using CloudFront for traffic? However, as we shifted 15% of our traffic to staging clusters for testing, SRE began reporting some failures with errors from CloudFront — specifically a 403 error. It was difficult to understand the situation because we still lack a logging system and are continuing to use dev clusters. Therefore, we opted to redirect the traffic to Akamai until this issue can be resolved.

After looking into it extensively, I discovered that the problem is actually more straightforward than you might think. Cloudfront doesn’t allow GET requests with a body, which causes a 403 error. Unfortunately, this can’t be altered and presents a significant challenge for us, especially since we have customers all over the world who may encounter issues if they send us GET requests with a body. It’s crucial to figure out these affected customers and reach out to them before moving forward with the migration.

Day-One & continue delivery

Maybe this is a good time to finish the logging task. Considering Day One, our goal is to create a real-time log system that should be testable before going into production. We also need to ensure it doesn’t disrupt our internal systems or other teams. To achieve this, we’ll set up a staging cloudfront in the developers environment and run load tests on it before promoting it to the live CDN. It’s complex but doable with automation in place.

Shifting production traffic:

Now we’re ready to start shifting traffic to Cloudfront. We have real-time logs, WAF, and everything we need. We just need to start shifting the traffic — starting with a small percentage (0.5%) to make sure our system is working fine.

With just a small percentage of traffic, everything looks fine. We see successful requests and some blocked by WAF as expected. After a few days, we scaled the traffic to 5%, then to 10%, until reaching 50%. Here comes the real challenge — some customers started complaining about unstable behavior. Half of their requests are failing on certificate and others are getting a 403 error. So, we decided to shift all the traffic back to Akamai until we find the core issue. Despite all our tests, surprises can happen during such large migrations, but deciding to shift traffic slowly and being able to revert back quickly was a good decision thanks to Route53 policy feature.

Cipher suites issue:

We work with a lot of customers, and they have different systems — some are new while others are old. We can’t control their systems, but we need to make sure their integration with us stays intact so they can keep doing transactions even after migration.

Before we delve into this issue, I would like to provide some details on why the cipher suite holds significance for us and how it can impact end users. To employ cipher suites, both the client and server need to come to a mutual agreement on which specific one will be utilized for message exchange. It is essential that both parties support the chosen cipher suite; otherwise, no connection will be established if there is disagreement. To address the cipher suites issue, PayU’s technical strategy involves thorough testing and coordination with customers to ensure seamless integration and transaction flow after the migration.

If you want to read more about, you can read this medium it will give you fully details about it, but let me show how the chiper suite will be done in handshake process:

The customer complained about a handshake issue, so we started debugging and checking our realtime logs. As expected, we didn’t see any requests in our log. We asked them to check their cipher suite and compared it to ours, which turned out to be the issue. We are using the newest version from our side, so we resolved it by using an older TLS version for now with a plan to change it later when the customers upgrade their system.

Realtime Log:

Realtime logging is essential during the migration process so that you can make quick and informed decisions. I also discussed our implementation of real-time logs in a recent blogpost, which you can read about here. After completing the migration, it might be worth considering switching to a more cost-effective solution like standard logging.

Monitoring & Alarms using anomaly detection

Cloudfront comes with a built-in alarm system that allows you to set up alerts for 4xx or 5xx error rates and other metrics. Sometimes, reducing the number of 4xx errors doesn’t necessarily mean that things are improving, as these status codes can be expected in various application states such as bad requests, authentication issues, and authorization problems. However, continuous monitoring and employing anomaly detection techniques can help identify abnormal patterns or trends in the error rates, allowing us to proactively address any issues.

Let’s consider this situation together: in a system with a login page, the 4xx error rate was initially 5%, but it decreased to 1%. In this scenario, we need to ensure that CloudFront alarms are set up to trigger for every change and consistently.

Anomaly detection allows us to set off alerts when it goes beyond the expected range. Let’s come up with a simple example to show how this works using numbers. Remember that the actual calculation of standard deviation and the behavior of the anomaly detection model will be more intricate and rely on ongoing learning from your metric data. But for explanation purposes, let’s consider some numbers to grasp the idea.

Average 4xx Error Rate: Let’s say your average 4xx error rate is 7%. This is the expected behavior based on past data.

Standard Deviation: Suppose, through its modeling, CloudWatch determines that the standard deviation for your 4xx error rate is 2%. This means that in the model’s view, normal fluctuations in your error rate are typically within 2% above or below the average.

Threshold Setting: You’ve set the threshold to 1. This means you want to be alerted when the error rate is more than 1 standard deviation away from the expected rate.

Now, let’s apply these numbers: The “normal” range for your 4xx error rate, given this standard deviation, would be from 5% (7% — 2%) to 9% (7% + 2%). This range represents fluctuations within one standard deviation from the average.

If you set an anomaly detection alarm with a threshold of 1, CloudWatch will alert you when the error rate goes below 5% or above 9%.

For example, if your 4xx error rate suddenly jumps to 10%, it’s more than one standard deviation (2%) away from your average (7%), and thus, your alarm would trigger. Similarly, if it drops to 4%, the alarm would trigger since it’s also more than one standard deviation away from the norm.

This setup helps you monitor for significant deviations from typical behavior, indicating potential issues or changes in your application or user traffic that require attention.

Why do we need to set up an alarm for 4xx errors? CloudFront automatically returns a 403 error for any unsupported requests, such as get requests with a body or when the WAF blocks traffic. Human errors could occur, especially since we use a lambda function to handle and direct traffic between other internal clusters, and keep adding more WAF rules. Any mistake in these areas could lead to an increase in 4xx errors, so it’s risky to make changes in production clusters without monitoring them. Monitoring and setting up an alarm for 4xx errors is essential because it helps detect any abnormalities or deviations from the expected behavior in terms of error rates and allows us to proactively address potential issues or changes in our application or user traffic

In CloudWatch, you can pick built-in metrics from CloudFront. In our case, we chose 4xxErrorRate and 5xxErrorRate, and set the anomaly detection threshold to 2. A higher number means a thicker band, while a lower number means a thinner band. Setting the anomaly detection threshold to 2 allows for a wider range of acceptable error rates, providing a greater tolerance for fluctuations in the system.

One more important configuration you need to customize, you need to define the number of datapoints within the evaluation period that must be breaching to cause the alarm to go to ALARM state.

Cost

Now let’s talk about the interesting part. One of the main reasons for our huge migration was AKAMAI cost. The plan is to decrease this number as much as possible without losing any feature. It was a worthy challenge; we created multiple Cloudfront distributions across multiple accounts — dev, qa, mars, and production — with real-time logs and without giving up any features. We succeeded in decreasing the yearly cost by more than 90% of CDN cost!