Testing Performance Of Cloudflare Workers

Gaurav Shukla
FarziEngineer
Published in
5 min readApr 13, 2019

For sometime I had been following the edge computing space for solutions which would allow me to move my code and not just content closer to end users and do it at an affordable price. There were not much offerings, AWS had Lambda and GCP came up with cloud functions but each had their disadvantages and each deserves an article of its own.

Then came Cloudflare with it’s own offering called Cloudflare workers. Named very appropriately because they mimic the service worker API provided by browsers. I just fell in love with the offering. Finally there was something which was easy to setup, master and affordable.

From here on I’ll refer Cloudflare workers as CF workers or just workers.

The Performance Test Parameters

  • I wanted to test how fast the reads from cloudflare’s were when accessed from inside workers
  • Was there a problem of cold start ?

Reads from cache

Cloudflare claims to have multi-tier caching for workers and also their CDN. The various tiers are as follows -

  • Node level cache
  • Regional Data Center Cache
  • Global Cache

In order to test the speed of lookup I created an endpoint on my origin node with cache headers containing max age of 10 years and a response body containing a string “This is performance testing”

On the worker side I created a script to fetch this endpoint from origin and return the time taken to fetch (processing_time) it from cache to cloudflare worker in the response headers.

I fetched the origin URL from my browser once before running the tests so as to ensure that it is already present in cloudflare’s cache when I start testing.

Expectations

I expected it to be < 5ms for Node level cache <10ms for Regional cache and <20ms for global cache.

How did I reach these expectations ?

I have been running a network of edge nodes very similar in nature and I achieve <1ms latency on the same node.

For regional cache in my current infra I don’t have the setup so I created a vm in the same datacenter and ran Redis on that. The latency with this setup was 5–10 ms and for global cache I just assumed that it should not be more than 20ms.

Running the tests

I used curl running in infinite loop to fetch the cloudflare worker endpoint and recorded the latency in a file. Later I analysed that file to calculate the stats.

Here are the stats -

Cloudflare workers stats for cache lookup (Numbers in ms)

So clearly it seems that the numbers are not in favour of workers. A peak of 962 ms ? really ? I ran the tests several times and the peak varied a bit but still it was always is range of 600–900 ms. Hmm, something is not right. May be the problem is the really high number of requests in short duration is leading to more worker instances being started (Cold start ??).

Second attempt

This time I introduced a sleep of 1s after every 5 requests. The numbers changed for the good and it seemed my hypothesis was correct. This time everything else remained the same only the peak reduced to 500ms and the occurrences of peak reduced to 2.

Good enough!

Now, I was left wondering what is peaks are due to the fact that my origin response is getting evicted from global or regional cache. I really did want to rule out cold start since the cloudflare team claims to have solved the cold start problem at root level by going for NodeJs isolates instead of containers and vm

And I also had to figure out why the response time varied so drastically when I was hitting the worker endpoint from one region only (Singapore). I needed someway to figure out which tier cache was I getting the response from.

Third attempt

I looked into docs for finding whether my object was found in cache or not. Cloudflare provides a nice header `cf-cache-status` in response. It indicates whether the request was HIT or MISS.

Unfortunately there was no way of finding which cache the object was found into so I just assumed that if the lookup took 0–5 ms it must have been on node level cache, 6–20 ms in regional cache and anything more than 20 on global cache.

Let’s find out the cache HIT and Miss ratio.

I ran the script once again. And the results were still the same. I still saw a couple of peaks 500–550 ms and the 100% HIT

Hmm, what if there is network flaky ness and what if the machine I am testing from is far from cloudflare’s edge node in terms or network hops.

Fourth Attempt -

I created a VM on AWS, Softlayer and GCP and a droplet on digital Ocean.

Ran my previous test again on each vm one by one, in the same Singapore region.

Results were largely the same with AWS and digital ocean machines being 1 additional hop away from cloudflare’s edge node they had about 5% more latency in the peaks.

Fifth Time is the charm -

I wanted to test in a more globalised manner. So I used site24x7 an awesome tool by Zoho and setup web service monitoring on worker endpoint.

For a baseline metric I created a similar endpoint on my network of edge servers.

For test locations I chose 50% locations where I had my edge nodes and remaining 50% where I did not have my edge nodes.

Ran the test for 1 month and results -

  • In the regions where I had my edge nodes my edge nodes performed about 57% better than cloudflare’s version which is understandable to some extent as in my case I always had the data in my node level cache
  • In regions where I didn’t have the servers avg response time of cloudflare was 50–172 ms varying from region to region whereas my response time was 300–900 ms.
  • In China, thanks to great fire wall. Both cloudflare and my edge nodes performed equally bad. With an exceptional case where traffic from china was routed to data center in Australia for some reason, in this scenario my nodes took 3 seconds to deliver the response while cloudflare did not have any such peak.

Conclusion -

  • Cloudflare workers being positioned as a general purpose solution is good enough and easy to use.
  • There seems to be no “cold start” problem as such but yes there is some latency in the first few requests
  • Cloudflare has a lot of points of presence hence it should in general reduce the latencies.

After this I tested AWS offering lambda@edge and also explored the cloudflare workers for more complex use cases. Stay tuned for more!

--

--