Testing with real traffic: How to use Edge Functions to ensure your new software is ready for production

Published in

aziontech

9 min readMar 8, 2024

There is a famous proverb that says “If it ain’t broke, don’t fix it”, however, this is not exactly true when we’re talking about software development. With new frameworks, tools, and processes popping up every day, the urge for change is high.

Whether you’re looking for changes because your codebase is so complex that every deployment is a nightmare, or maybe you want to build a new solution from scratch because your application’s performance is not the same as it used to be, there’s always a question that pops up in mind:

Will the new software hold up the amount of data carried by my application in the “real world”?

That’s why testing is really important in software development.

While there are some really nice tools for load testing and stuff, what about those very specific use patterns that only real users can think about? It would be really nice if you could use real data to test our new software, innit?

Well, using Azion Cells, we actually can!

Mirroring real traffic for testing a new origin with Edge Functions

One of the features of the 'Firewall' listener available in the Azion Cells, is that it can execute tasks without causing any impact on the user request and without affecting any characteristics of our Edge Application/Firewall (well, except that we need to create a rule to trigger the Edge Function, of course).

So, we can create a function that for every request our Firewall receives, it will send a “sub-request” to the new origin we want to test with the production data.

The simplest example possible would be something like this:

const TEST_URL = "https://www.my-new-domain.com";

async function firewallHandler(event) {
  fetch(TEST_URL);
  event.continue();
}

addEventListener("firewall", (event) => event.waitUntil(firewallHandler(event)));

And a rules engine rule that would look like this:

With this simple edge function, we would be able to transmit the same volume of requests from the production environment to our test origin.

However, we’d not be exactly simulating the real requests: What about the request method? The request URI? Request method?

So, we’d have to create a slightly complex code.

const TEST_DOMAIN = "www.my-new-domain.com";

async function firewallHandler(event) {
  const originalUrl = new URL(event.request.url);
  const testUrl = `${originalUrl.protocol}//${TEST_DOMAIN}${originalUrl.pathname}${originalUrl.search}`;

  let fetchOptions =  {
    method: event.request.method,
    headers: Object.fromEntries(event.request.headers)
  }

  if (event.request.body) {
    fetchOptions["body"] = await event.request.text();
  }

  fetch(testUrl, fetchOptions);
  event.continue();
}


addEventListener("firewall", (event) => event.waitUntil(firewallHandler(event)));

With this new code, we’d be able to send a true duplicate of the original request to our test origin.

Note that as we don’t “await”, the fetch latency time added to the request is minimal: for 100.000 (a hundred thousand) requests, the average latency for the function’s execution was lower than 0.001 seconds. The problem with this approach is that when we don’t “await” for a fetch response, the Cells might close the request before the origin responds, so to guarantee that the request to our new origin is always completed, we must modify our code a little bit.

const TEST_DOMAIN = "www.my-new-domain.com";

async function firewallHandler(event) {
  const originalUrl = new URL(event.request.url);
  const testUrl = `${originalUrl.protocol}//${TEST_DOMAIN}${originalUrl.pathname}${originalUrl.search}`;

  let fetchOptions =  {
    method: event.request.method,
    headers: Object.fromEntries(event.request.headers)
  }

  if (event.request.body) {
    fetchOptions["body"] = await event.request.text();
  }

  event.waitUntil(fetch(testUrl, fetchOptions));
  event.continue();
}

addEventListener("firewall", (event) => event.waitUntil(firewallHandler(event)));

By calling the fetch inside the “event.waitUntil” method, it will make the Azion Cells itself wait for the request to be completed, meanwhile, our edge function won’t wait, as we immediately call the “event.continue” method. This way we are guaranteeing the execution of the fetch to our new origin, without adding latency to our final users.

So, I’ve mirrored the traffic, what do I do now?

Once everything is set up, all we need to do is wait for the new duplicated requests to arrive at our new origin and then monitor their status. We can check if there’s any degradation in the new origin, check that there has been an increase in latency compared to our “original” origin, check if there has been an increase in the number of server errors compared to the “original” origin, among other factors that can help us to understand if the new origin is ready to be used as the main origin in a near future.

Is there something else to do in the edge function?

If you cannot actively monitor what is going on in your new origin, we can add some actions in the edge function. With the cost of higher latency for the users, the function would not be “transparent” for the end user anymore, still, it might be useful as we can use the Azion’s Real-Time Events to give us statuses about how the new origin is working.

By adding an actual “await” in the fetch to the new origin, we can, for example, add a log whenever the new origin returns a 5xx status code.

const TEST_DOMAIN = "www.my-new-domain.com";

async function firewallHandler(event) {
  const originalUrl = new URL(event.request.url);
  const testUrl = `${originalUrl.protocol}//${TEST_DOMAIN}${originalUrl.pathname}${originalUrl.search}`;

  let fetchOptions =  {
    method: event.request.method,
    headers: Object.fromEntries(event.request.he aders)
  }

  if (event.request.body) {
    fetchOptions["body"] = await event.request.text();
  }

  const testOriginResponse = await fetch(testUrl, fetchOptions);
  if (testOriginResponse.status > 499) {
    event.console.log(`New origin status code: ${testOriginResponse.status}`)
  }

  event.continue();
}


addEventListener("firewall", (event) => event.waitUntil(firewallHandler(event)));

With this new code, whenever our new origin receives a 5xx response, a log like the following would pop up in the Real-Time Events:

Note that, with the "await", the average latency for the function’s execution increased to 0.014 seconds. That said, as in the previous test the function’s average latency was just 0.001 seconds, we can deduce that the average latency of our new origin is around 0.0139 seconds. But, it is important to realize that for the end user, this new approach made the function 14 times slower than the original ideal (without the await). And that for a really quick origin, the slower your new origin is to respond, the more the end user will be affected by this “await”.

Latency is not that much of a big deal, I want the function to extract more data about the sub-request to the new origin

If for your testing, increasing the request time for the end user is not a deal-breaker, we can add more details in the logs written by the edge function. For example, for each request, even those that don’t end up resulting in a 5xx status code, we can log the latency of the subrequest.

const TEST_DOMAIN = "www.my-new-domain.com";

async function firewallHandler(event) {
  const originalUrl = new URL(event.request.url);
  const testUrl = `${originalUrl.protocol}//${TEST_DOMAIN}${originalUrl.pathname}${originalUrl.search}`;

  let fetchOptions =  {
    method: event.request.method,
    headers: Object.fromEntries(event.request.headers)
  }

  if (event.request.body) {
    fetchOptions["body"] = await event.request.text();
  }

  const now = Date.now();
  const testOriginResponse = await fetch(testUrl, fetchOptions);
  const timeSpent = (Date.now() - now)/1000; // Get the time spent in seconds
  event.console.log(`[${testOriginResponse.status}, ${timeSpent}]`);

  if (testOriginResponse.status > 499) {
    event.console.warn(`New origin status code: ${testOriginResponse.status}`)
  }

  event.continue();
}


addEventListener("firewall", (event) => event.waitUntil(firewallHandler(event)));

The resulting log would look like this:

But the possibilities do not end there. We can also, for example, want to know more about the requests to the new origin that were not successful, so we can add a more detailed “request x response” log for those situations:

const TEST_DOMAIN = "www.my-new-domain.com";

async function firewallHandler(event) {
  const originalUrl = new URL(event.request.url);
  const testUrl = `${originalUrl.protocol}//${TEST_DOMAIN}${originalUrl.pathname}${originalUrl.search}`;

  let fetchOptions =  {
    method: event.request.method,
  headers: Object.fromEntries(event.request.headers)
  }

  if (event.request.body) {
  fetchOptions["body"] = await event.request.text();
  }

  const now = Date.now();
  const testOriginResponse = await fetch(testUrl, fetchOptions);
  const timeSpent = (Date.now() - now)/1000; // Get the time spent in seconds

  event.console.log(`[${testOriginResponse.status}, ${timeSpent}]`);

  if (testOriginResponse.status > 399) {
    event.console.warn(JSON.stringify({
      request_method: event.request.method,
      request_body: fetchOptions["body"],
      request_headers: fetchOptions["headers"],
      response_body: await testOriginResponse.text(),
      response_status: testOriginResponse.status
    }));
  }

  event.continue();
}


addEventListener("firewall", (event) => event.waitUntil(firewallHandler(event)));

This way, we can analyze the logs to understand if the user was blocked for good, or if it’s a problem in our new origin, for example.

Something that you might be thinking now is “What if the new origin is down, or takes too long to respond?”, this is for sure a problem when using this approach of waiting for the response.

But we can prevent it by adding an “AbortSignal” in the fetch.

const TEST_DOMAIN = "www.my-new-domain.com";

async function firewallHandler(event) {
  try {
    const originalUrl = new URL(event.request.url);
    const testUrl = `${originalUrl.protocol}//${TEST_DOMAIN}${originalUrl.pathname}${originalUrl.search}`;

    let fetchOptions =  {
   method: event.request.method,
   headers: Object.fromEntries(event.request.headers),
   signal: AbortSignal.timeout(1) // The user should not wait for more than 1.5 seconds
    }

    if (event.request.body) {
      fetchOptions["body"] = await event.request.text();
    }

    const now = Date.now();
    const testOriginResponse = await fetch(testUrl, fetchOptions);
    const timeSpent = (Date.now() - now) / 1000; // Get the time spent in seconds

    event.console.log(`[${testOriginResponse.status}, ${timeSpent}]`);

    if (testOriginResponse.status > 399) {
      event.console.warn(JSON.stringify({
        request_method: event.request.method,
        request_body: fetchOptions["body"],
        request_headers: fetchOptions["headers"],
        response_body: await testOriginResponse.text(),
        response_status: testOriginResponse.status
      }));
    }
  } catch(err) {
    if (err.name === "TimeoutError") {
      event.console.warn("The request to the new origin took too long!");
    } else {
      // In case of any other sort of error
      event.console.warn(`Generic error handler. Error: ${err.message}`);
    }
  }

  event.continue();
}


addEventListener("firewall", (event) => event.waitUntil(firewallHandler(event)));

Now, whenever the new origin takes too long to respond, the fetch is aborted and the following log is written:

Building a reusable function

So far, we’ve built a function that solves our use case, however, it does not have the best “re-usability”, so to speak. Because it has hardcoded arguments, like the URL of the new domain or the timeout used in the abort signal.

We can improve it by adding arguments via environment variables and/or JSON Args. To do so, we have to modify our code just a little bit:

async function firewallHandler(event) {
  try {
    const testDomain = event.args.url || Azion.env.get("TEST_URL") || "www.my-new-domain.com";
    const testTimeout = event.args.timeout || Azion.env.get("TEST_TIMEOUT") || 10000;

    const originalUrl = new URL(event.request.url);
    const testUrl = `${originalUrl.protocol}//${testDomain}${originalUrl.pathname}${originalUrl.search}`;

    let fetchOptions =  {
      method: event.request.method,
      headers: Object.fromEntries(event.request.headers),
      signal: AbortSignal.timeout(testTimeout) // The user should not wait for more than 1.5 seconds
    }

    if (event.request.body) {
      fetchOptions["body"] = await event.request.text();
    }

    const now = Date.now();
    const testOriginResponse = await fetch(testUrl, fetchOptions);
    const timeSpent = (Date.now() - now) / 1000; // Get the time spent in seconds

    event.console.log(`[${testOriginResponse.status}, ${timeSpent}]`);

    if (testOriginResponse.status > 399) {
      event.console.warn(JSON.stringify({
        request_method: event.request.method,
      request_body: fetchOptions["body"],
      request_headers: fetchOptions["headers"],
      response_body: await testOriginResponse.text(),
      response_status: testOriginResponse.status
      }));
    }
  } catch(err) {
    if (err.name === "TimeoutError") {
      event.console.warn("The request to the new origin took too long!");
    } else {
      // In case of any other sort of error
      event.console.warn(`Generic error handler. Error: ${err.message}`);
    }
  }

  event.continue();
}


addEventListener("firewall", (event) => event.waitUntil(firewallHandler(event)));;

Now, our function will first try to load the URL and the timeout from the JSON Args, if there’s none, then it’ll try to load the data from the environment variables, or finally, use the hardcoded default values. With this change, it would be pretty simple to create multiple instances of the function, running with different arguments.

Final thoughts

Mirroring the traffic from the production environment provides a simple way to test a new piece of software and ensure it is ready for production. Not only it helps us to check if the new software can handle the number of requests that our application usually receives, but it also can be used to verify how the new software will behave in real-world requests, with real user data. Of course, it demands a little bit of logging/data analysis to understand the outputs, but it can be an important tool to find out if there’s any critical problem with the software we’re looking forward to deploying.

Interested in knowing more about it?
Check Azion's documentation and find us on Discord.