How to Gradually Rollout Software Updates to Cloudflare Workers
A Three-legged design
In this article, I will describe how we managed to roll out software updates to Cloudflare edge workers gradually, with near real-time control over the rollout process. I will illustrate the design of the rollout mechanism we decided to implement.
Update: Cloudflare launced Gradual deployments and Roll back. See here:
https://developers.cloudflare.com/workers/configuration/versions-and-deployments/gradual-deployments
There could still cases when you would like your own implementation of gradual deployement.
We Wanted Gradual Rollout
Our website runs 24/7, allowing constant access to all the services it has to offer. The edge workers execute code that affects every request. With changes being deployed virtually simultaneously to all edge servers — this makes every update a potential hazard.
I need a method to deploy updates progressively over time while still assuring minimal user disturbance and maximum reliability. I also require a way to rapidly undo changes if something goes wrong. Depending on the program and CI/CD system availability — changes can take long minutes to deploy, which I’d like to avoid.
A Three-legged Design
To allow more than one version to be active on the Cloudflare workers platform, I can duplicate the app; I’ll call the copies “Current” and “Canary”. I added a lean “Switch” app which decides where to route each request.
The rollout itself will be controlled by an eventually consistent Key-Value store, which will be read by the switch but updated by the rollout process itself, as a part of the continuous deployment.
Allocation Logic
Each request is routed to the Switch service, which executes the decision tree for version rollout allocation. It then forwards the request, through worker binding, to the appropriate service.
Fresh requests are assigned a number between 0 and 100, which corresponds to the rollout, either using deterministic logic or at random. The rollout will have a value of 0–100, where requests that fall within the rollout limit get to use the updated version of the application (Canary).
The allocation is persistent and will last the duration of the visit.
Because I employ deterministic algorithms for allocation (redacted), visitors should receive consistent allocation even after the stickiness expires.
/***************************************
* Simplified Cloudflare worker switch *
***************************************/
import { parse, serialize } from "cookie";
/**
* Services and K/V are bound via wrangler.toml file
*/
interface Env {
CURRENT: Fetcher;
CANARY: Fetcher;
ROLLOUT: KVNamespace;
}
/**
* TODO: Implement deterministic logic here
*/
const allocateRequest = (request: Request): number => Math.random() * 100;
const ROLLOUT_KV_KEY = "percent-open";
const ROLLOUT_COOKIE_NAME = "edge_rollout";
/**
* Forward the request to appropriate worker
*/
const handler: ExportedHandler = {
async fetch(
request: Request,
env: Env,
ctx: ExecutionContext
): Promise<Response> {
ctx.passThroughOnException();
const rolloutPercentage =
Number(await env.ROLLOUT.get(ROLLOUT_KV_KEY)) || 0;
// Short circuit when completely closed
if (rolloutPercentage === 0) {
return env.CURRENT.fetch(request);
}
// Retreive existing allocation or allocate request
const allocationCookieValue = parse(request.headers.get("cookie") ?? "")[
ROLLOUT_COOKIE_NAME
];
const requestPercentAllocation = allocationCookieValue
? Number(allocationCookieValue)
: allocateRequest(request);
// Rollout forwards request to canary, the rest to current application
const inRollout = requestPercentAllocation <= rolloutPercentage;
const handler = inRollout ? env.CANARY : env.CURRENT;
let response = await handler.fetch(request);
response = new Response(response.body, response); // Make request mutable
// Cookie used for quick persistance with subsequent requests
response.headers.append(
"set-cookie",
serialize(ROLLOUT_COOKIE_NAME, requestPercentAllocation, {
domain: request.headers.get("host"),
httpOnly: true,
maxAge: 300, // 5 minutes
path: "/",
priority: "low",
sameSite: "lax",
secure: true,
})
);
return response;
},
};
export default handler;
Continuous Delivery
Now that I’ve established the framework, the delivery pipeline needs to handle the routine of exposing users to the new version.
The delivery pipeline will behave differently when pushing to side branches versus pushing to the main branch.
On-side branches, unit, and integration tests will need to pass before deploying a canary version. Each rollout increase will be personally approved, and the percentage of customers exposed to the rollout will be gradually increased. There is a simple rollback option at each stop in case our version introduces unexpected results.
After we have gained some trust in the system, we may choose to automate the rollout plan to wait for external input such as request success rate or any other measure, and then proceed or roll back accordingly.
Once the rollout has been exposed on 100% of the visits, The developer may merge the side branch to our protected main branch, deploy, and then roll back to 0%. As a result, the updated version is currently being used by 100% of the users.
The main branch’s pipeline is straightforward. It does not require approvals because it is critical that the version management system accurately represents reality. The “Current” application is what’s on your main branch, and it happens automagically.
Bonus points: Testing
Another challenge was introduced with local development and integration testing. Instead of the local code, service binding was pointing to the real services. For integration tests of Cloudflare worker, I create a request and pipe it through. Then, assess the response and any adverse effects. I mock the native “fetch” to mimic the expected origin’s response. I need the entire orchestration to collaborate.
I need the means to pipe the request through the application code locally. This requires a specialized entry point, that would establish an Env
object with both application’s environment variable needs, then superimpose the application’s “fetch” function to include the request context and environment. Local development uses this entry point instead of the “switch” directly.
/*************************************
* Entry point for local development *
*************************************/
import switchHandler from "../switch";
import mainHandler from "../application";
/**
* LocalEnv includes env vars for both environments
*/
interface Env {
CURRENT: ExportedHandler; // for "switch" (replaces Fetcher)
CANARY: ExportedHandler; // for "switch" (replaces Fetcher)
ROLLOUT: KVNamespace; // for "switch"
GATEWAY_TRAFFIC_ANALYTICS: AnalyticsEngine; // for "application"
}
const handler: ExportedHandler = {
async fetch(
request: Request,
env: Env,
ctx: ExecutionContext
): Promise<Response> {
const { fetch: mainHandlerFetch } = mainHandler;
// Override the Fetcher function to include env and ctx
mainHandler.fetch = function (request) {
return mainHandlerFetch(toHttps(request), env, ctx);
};
// Set local app as CANARY and CURRENT
env.CANARY = mainHandler;
env.CURRENT = mainHandler;
return switchHandler.fetch(toHttps(request), env, ctx);
},
};
/**
* Convert a request to HTTPS in order to go to actual origin
*/
function toHttps(request: Request): Request {
const url = new URL(request.url);
url.protocol = "https:";
return new Request(url.toString(), request);
}
export default handler;
Conclusion
When releasing new code to our application’s gateway, the phased approach appears advantageous.
Despite certain challenges, the right strategy and practices can ensure issues are addressed before they become problems and guarantee a seamless, stable, and scalable deployment.
Update— Performance improvement
In response to a conversation with @okh397, I’ve switched from using the K/V store, which caused delays in the reading process, to using secrets, which can be changed without having to redeploy the program.
echo "10" | wrangler secret put ROLLOUT --env canary