Groupcache instrumented by OpenCensus

Groupcache * OpenCenus * Go

Distributed caching is a pervasive technique for scaling web services. Distributed caches help alleviate latency loads by allowing reuse of expensively produced data, because generation of new data is usually more expensive than lookups of the already generated data. Throughout the history of the internet, caching technologies like Memcached, Redis, Varnish etc have thrived and enabled web services to scale since data like images, static pages can be distributed to caches geographically closer to users and most of that data doesn’t change. Each caching technology comes with its own merits and demerits, but most systems containing distributed caches suffer from a couple of interesting common problems, of which I’ll talk about two:

  • Hot keys: some items might be very highly popular such as after a Super Bowl touchdown, 15 million requests hit your search engine asking for “Tom Brady”. Without proper sharding, if a specific key is sharded to one machine/cluster and all 15 million requests concurrently ask that entity for that result, most likely you’ll have massive and crippling CPU and RAM usage spikes— hence the name “hot keys”
  • Thundering herd problem: usually requests firstly check the edge caches, and if the requested data is not available, then proceed to the origin/generating servers to get a fresh response and afterwards cache the response for later lookups. If numerous requests simultaneously ask a cache for an item that isn’t available, naively it will declare a cache miss and might allow each request to proceed to the meager powered origin/generating service. This situation inadvartently defeats the purpose of replicated caching — which was to avoid massive load on the origin server. In turn, the massive load to the origin server could take it down, but on it trying to recover or on retries (because perhaps the previous cache-write through policy wasn’t satisfied), the caches will still report a cache miss and the cycle of overloading the origin server will repeat — hence the name “thundering herd”. Picture this stampede of wildebeest running away from lions at Maasai Mara, but imagine the wildebeest being able to run infinitely around a safe space and that safe space represents your web service
Depiction of a “thundering herd” all rushing towards your backend having being let through by consequence of an innocent simultaneous cache miss! However these beasts will keep coming at your system due perhaps to retries and the cycle of problems just repeats, with your system not being able to recover #DOS #DDOS

The “thundering herd” problem can be curtailed by ensuring that only one request makes it to the origin server while the other requests wait for the cache to fill https://en.wikipedia.org/wiki/Thundering_herd_problem

For tangible severity of the “thundering herd” problem, please take a look at how Facebook encountered and mitigated the problem while rolling out Facebook Live globally https://code.fb.com/ios/under-the-hood-broadcasting-live-video-to-millions/


Groupcache is a simple modern-day distributed cache written in Go, also from the brilliant mind of Brad Fitzpatrick https://en.wikipedia.org/wiki/Brad_Fitzpatrick the same person who created Memcached. Groupcache is entirely open sourced at https://github.com/golang/groupcache/

Groupcache is both a server and client, allowing you to deploy a cache within or besides your process, but also allowing easy replication of your service. Its API only allows for “Gets” and nothing else. If deployed with peers, Groupcache coordinates cache filling and synchronizes misses by broadcasting to its peers asking which machine has a group key. Its design inherently curtails the problems that I mentioned, namely the “thundering herd” and “hot keys” problems. It solves the “thundering herd” problem by ensuring that only one cache miss evokes cache filling, and then it multiplexes the data amongst its connected peers so that subsequent queries will return hits without having to repeat the process. It solves the “hot keys” problem, by consistent hashing which distributes load for a single key to multiple machines instead of always just directing to one machine. With consistent hashing, more than one machine can be responsible for handling the load of a key, but also if any machine within a hash ring goes down, it can taken offline and or even more added online without noticeable disturbance.

Groupcache is currently used in production within:

and many more unlisted users…

OpenCensus is a vendor-agnostic single distribution of libraries for observability by providing distributed tracing and monitoring. It allows you to inspect the state of your distributed systems, both monoliths and microservices, at scale. OpenCensus is implemented in a variety of languages like Go, Java, Python, C++, Node.js, C#, Ruby, PHP. It exports traces and metrics to a plethora of backends like DataDog, AWS X-Ray, Google Stackdriver Tracing and Monitoring, Zipkin, Jaeger, Instana, Prometheus etc.


I am very proud to announce that Groupcache has been instrumented with OpenCensus to provide observability with distributed tracing and monitoring with the source code of the fork at https://github.com/orijtech/groupcache

In this instrumentation, there are two changes to Groupcache’s API:

a) groupcache.Context uses are replaced with context.Context When Groupcache was initially written in 2012 and in the early days of Go, the now very useful interface context.Contextwas not yet a first class citizen nor had it been thought about concretely, hence signatures to groupcache functions took a custom typed groupcache.Context which was just a blank interface{}

As per https://godoc.org/github.com/golang/groupcache#Context

For easy compatibility with the original code, I could have type asserted within each function to ensure that the Context being passed into functions was context.Context https://golang.org/pkg/context/ However, I figured that it is due time to make a concrete change to context.Context, and it also spells out modern day intentions of using Groupcache in distributed system deployments, in which context propagation is king. Therefore you’ll now see

Now using context.Context instead of the legacy custom type Context

Also, we now collect a couple of metrics such as:

  • Roundtrip latency
  • The number of Loads
  • The number of Gets
  • The number of cache hits
  • The number of cache misses
  • The number of peer loads
  • The number of peer errors
  • Sizes of key and value lengths to help in surveying load and common lengths to optimize for
  • The number of server requests

as per https://github.com/orijtech/groupcache/blob/70a28e00f12a64a6cdd63fbb77d6d0c03782c84d/observability.go#L40-L56


Demo time!

To demo the instrumentation, I have taken an excerpt out one of my microservices “mapbox-search”. It is a part of an on-demand transport-leasing service that leverages transportation networks for clients by the push of a button. Registered clients can perform deliveries, request physical orders and merchandise etc but also order transportation from point A to point B without having to have a credit card — this is useful for ride conceirge services for hotel guests, interviewees, staff visiting new countries and cities etc and it uses Uber as the default transporter, but can also use Lyft and many others for which there exist API clients. It is mostly still a research project that I am using to study the effect of transport networks but also the effect of surges on supply and demand and scale. However, a small part of the service is used in production.

Anyways, the geolocation service looks up locations using Mapbox’s API https://www.mapbox.com/api-documentation/ Given a place’s name or searchable text, it returns coordinates, or if given coordinates, it returns symbolic landmarks or places.

Architecture diagram

It is important to keep the system monitored so that we can know when our SLAs degrade, when errors occur but above all, to detect anomalies. mapbox-search uses Groupcache for caching and since Groupcache allows its peers to talk to each other, we can horizontally scale the service and despite any traffic spikes, there will always be only single query that ever makes it to Mapbox’s API despite whatever number of simultaneous cache-misses. The caching is needed because querying Mapbox’s API is costly in both latency(network and throttling costs) and monetary dimensions. The mentioned “thundering herd” problem doesn’t suffice because of Groupcache’s inherent design which I believe comes from Brad’s experience from years of dealing with planet scale management of web services.

Service server

The server allows clients to search through two different routes

  • /names
  • /latlon

If a search term is requested from the server and it is hasn’t yet been cached, a request will be made to Mapbox’s API and as per the routes, we process those entries separately. Since locations on earth almost never change latitude and longitude, it makes sense to cache the responses. For caching, Groupcache allows you to define how to process cache misses with a processing step in a simple API using“groups” which take a name/key and return a value.

Here is the code for both the “names” and “latlon” groups

which are basically the same calls that we would make without the caching — notice too that here those calls are also instrumented with OpenCensus. The entirety of the isolated search service server is:

Mapbox search service server code

Client

The demo client is a command-line interface that when fed a query by line, makes an HTTP call to the mapbox search service and prints out results

search client

And for the actual search demo

sample search

Demo results

Tracing

On cache miss searching for “York” as seen on Stackdriver Tracing
On cache hit after searching for “York” as seen on Stackdriver Tracing

Metrics

groupcache roundtrip_latency
groupcache Loads

As you can see from above, we now have observability into our system using OpenCensus for a service that uses Groupcache as the caching layer.


OpenCensus!

OpenCensus is comprised of a vibrant community with very welcoming individuals with diverse skills, backgrounds and hailing from various companies. The project is developed entirely in the open at https://github.com/census-instrumentation and by it being vendor agnostic, it enables anyone to trivially swap out backends without being tied to one provider. Users, vendors, providers, well wishers and spectators are all welcomed! We are making even more integrations with frameworks and projects, with a whole lot more to be announced soon. Please feel free to check out the project and share it with your friends and workmates https://opencensus.io/ but also follow it on social media https://twitter.com/opencensusio and check out the Gitter lobby https://gitter.im/census-instrumentation/Lobby

Thank you for your time and attention, in reading this far!

Kind regards,

Emmanuel T Odeke