Redis clients instrumented by OpenCensus in Java and Go

Published in

Orijtech Developers

12 min readJun 14, 2018

OpenCensus for instrumented Redis clients in Java and Go

In this post we’ll examine Redis clients instrumented with OpenCensus in Java and Go, and apply them directly to use Redis in a media search service to alleviate costs, throttling, load, bandwidth and overall monetary expenses for a product.

TL;DR: We are excited to announce that we’ve instrumented some popular Redis clients in Java and Go for both tracing and metrics. The instrumented Redis clients are:
Jedis in Java: https://github.com/orijtech/jedis/pull/1
Gomodule/redis in Go: https://github.com/orijtech/redigo/pull/1
Go-redis/redis in Go: https://github.com/orijtech/redis/pull/1
You can see more of some of the integrations that we are rolling out at https://opencensus.orijtech.com OpenCensus is a vibrant project that warmly welcomes you, please take a look at https://opencensus.io

The rest of the blog post introduces OpenCensus, talks about the parity between it and observability for distributed microservices and then we use the instrumented Redis clients to alleviate costs, load and latency, in a live search app for searching for content on iTunes.

Redis is one of the world’s most popular server caching and scaling technologies. It is an in-memory data structure and persistence store that runs on a single threaded server but it is very horizontally scalable through clustering which automatically shards your data. Redis was created by Salvatore Sanfillipo http://invece.org/ in 2009. Redis is an open source project that has powered a variety of companies and websites over the years. Amongst well-known technology and product companies, as per https://redis.io/topics/whos-using-redis https://stackshare.io/redis/in-stacks and also from other sources, I can name Redis users: Twitter, Snapchat, Walmart, Github, Craigslist, Stackoverflow, Lyft, Pinterest, Instagram, Patreon, AirbnB, Uber , Heroku and many others.

Hosting Redis as a service, is a revenue source for many providers such as Redis Labs https://redislabs.com/, AWS https://aws.amazon.com/redis/, Google https://cloud.google.com/memorystore/ , Compose https://www.compose.com/databases/redis, Microsoft https://docs.microsoft.com/en-us/azure/redis-cache/ and many others.

Redis helps distributed applications share data (whether persistent or ephemeral), by offering optional durability, but even better it provides a variety of data structures like sorted lists, hashes, lists, hyperloglog and efficient algorithms for access and manipulation such as geospatial indices and querying. Its parity with traditionally understood data structures makes it very easy for programmers to articulate their ideas while reasoning about algorithmic complexity, like they would with data structures in their code, without having to reinvent the wheel moreover in the case of highly distributed applications where data races are a nightmare to deal with. Redis also offers essential services and abstractions like pubsub (publish-subscribe) https://redis.io/topics/pubsub that allow butterfly effect-like propagations of events and data manipulation — for example in this era of social media, (assuming this is what Twitter does, for simplicity): all the Dalai Lama’s 18.9 million Twitter followers could get a push notification sent to their devices and feed, whenever he tweets. This could perhaps be accomplished by Redis sending a message/event aka “publishing” to our software listening agents a message on the #dalai-lama-channel for any such posting, using pubsub to transmit to any “subscribers”. In traditional distributed system infrastructure, engineers would have to implement such tools within each company or language stack and even have to deal with very complex models and figure out how to persist such data in a highly concurrent environment. With the pubsub abstraction, we are able to model the problem and prescribe the solution in a highly concurrent and distributed environment(obviously there is a lot more behind the scenes than the simplistic abstraction that I claimed could be used at Twitter)

OpenCensus https://opencensus.io is a vendor-agnostic, single distribution of libraries for distributed tracing and monitoring for microservices and monoliths alike. OpenCensus is the entirely open sourced rewrite of the observability systems “Census”, that have powered Google’s infrastructure for the past 10 years. It is implemented in a variety of languages like Go, Java, Python, Erlang, Node.js, Scala, Ruby, PHP, C#(coming soon) OpenCensus allows you to collect metrics and traces once and easily, but to then export to a variety of backends like Prometheus, AWS X-Ray, Stackdriver Tracing and Monitoring, Zipkin, Jaeger, Instana, DataDog etc.

In this demo, we’ll export:

Metrics to Prometheus and Stackdriver Monitoring
Traces to AWS X-Ray and Stackdriver Tracing

The purpose of observability in distributed applications is to aid in root cause analyses, monitoring your systems with minimal overhead or invasion throughout complex systems for which traditional logging and monitoring solutions would crumble.

To illustrate the point above, picture just 4 microservices in your stack — these can be distinctly counted on your hand, trivially. If service 1 talks to service 2, and service 2 talks to service 1, we have 2 simultaneous interactions — it is trivial to count and reason about simultaneous interactions. If 3 services talk to each other, we can count:

s1 — s2, s2 — s1, s1 — s3, s3 — s1, s2 — s3, s3 — s2

and those are 6 simultaneous interactions. Still kind of trivial to reason about simultaneous, but thinking about 6 interactions at the same time is a little tedious.

For the formula to calculate the number of bidirectional interactions, if we have n services in your stack. If each one talks to the other, we have n * (n-1) interactions that could occur concurrently. If n is even 4, that’s 12 interactions. If n is 5, that’s 20 simultaneous interactions, if n is 6, that’s 30 simultaneous interactions and this point good luck trying to reason about their interactions in your head without creating a detail connections graph. This is already a heavy cognitive load to be reasoned about by anyone. Make n 7, 8, 9, 10…etc and you’ll see that this gets out of hand very quickly.

Try to reason about the interactions between those 6 services being instantaneously invoked many times per second — such an exercise becomes a cognitive load for any human

In many distributed applications, n can easily be larger than 10 for example https://eng.uber.com/building-tincup/

If a service goes down and no one notices, that worsens the availability of the system and makes for bad user experience, lost revenues, slow response time, catastrophes waiting to happen. The network might be clogged, your system’s RAM might get heavily consumed, it might run out of file descriptors, response time promises degrade beyond an acceptable threshold. Within a decoupled architecture in which your applications run on different time horizons and in different environments, it becomes hard to unify your logging and monitoring. Observability: tracing and monitoring help alleviate this problem.

Traditional monitoring and tracing has usually been vendor-centric. What I mean by this is that when a service is being built, the building team due to liking one Application Performance Management(APM) backend might implement this observability code specifically targeted to a single backend. This might work really well for one library and one backend. However, once you import another library that exports to a different backend, or if you’d like to change your backends, unfortunately that’s going to be an invasive cost, you have to re-edit your code, test it, deploy it or worse patch up the previous code to also accept this new backend. What happens when your company acquires another company that uses an entirely different vendor? Now we are running the risk of having to maintain code that exports to more than one backend, and moreover having been written in-house, they need the local expert who has all the context and has wrought as many bugs as they can. Even worse, for every re-implementation of a library, the author could skim over some detail and cause nasty bugs to creep into your services. With OpenCensus, one can collect metrics and trace your applications once, but only switch the backends to consume the data, as you please. Exporting and switching to a backend of choice becomes a mere import and setup, no distributed systems expertise, no need to burden your teams with maintenance nor the need to hire the local expert. You can focus on building your product and team instead of maintaining tertiary code.

To examine the usage of Redis and the OpenCensus instrumentation, we’ll take the backbone of one of my apps; a media search app with a gRPC powered backend, and Java and Go frontends. The gRPC powered backend searches for content on iTunes. A usage of this app is when giving relevant results for media searches, directing from your search engine and to those of your partners or affiliates.

In the beginning our architecture diagram looks like this:

System architecture and the bloodline of the product

Implementation:

Service/Server:

To begin, our gRPC backend/search service needs a protobuf definition

src/main/proto/defs.proto

and to implement the search service, this is what the Go implementation will look like

rpc/search.go implementing the proto service for searching for media from iTunes

and then for the Makefile for Go to generate code that’ll be used by the server and client

Makefile

and then the actual code to run the rpc Server

Code to run the main rpc Server, traced with OpenCensus for distributed tracing and monitoring

With the server defined and open for business, however we have just yet to implement the clients in order to make it possible for consumers to use the services.

Clients:

Java client:

src/main/java/io/mediasearch/search/MediasearchClient.java

Go client:

Go client

and on running them, we get back such results

man exec:java -Dexec.mainClass=io.mediasearch.search.MediasearchClient

go run client.go --entity=all --country=us

With that we believe it is a mission well accomplished, we are in great shape, able to generate revenues from both APIs and from a consumer tech/media product!! Congratulations are sent around, the entire team gets thanked, we all go back home happy! The sales teams get the go-ahead to start ringing up customers and give them early access to the service — this is the time that we’ve all been waiting for!!

Once we are all home, a member of the team decides to write the PR blog post to announce this new product to the world, and while at it decides to post up screenshots to show the world how things are running, how we run things at scale but also as an internal memo to encourage a culture of building products by articulating technological prowess. For that blog post, let’s take a look at AWS X-Ray(for tracing) and Stackdriver Monitoring (for monitoring)

926ms as seen on AWS X-Ray to search for “On a plain”

Stackdriver Monitoring, latency examination

Wait, hold up? Most of the queries are coming back in the range of 100ms to 1000ms??? What happens if I send the same request?

Searching thrice for On a plaingives me back results in 926ms, 441ms and 586ms respectively! We really shouldn’t have to perform that search again, how come we don’t have a cache?? Apparently we forgot the memo

Now we have to incur the cost of latency, throttling, API calls and degraded service for our clients when there are there is load on our servers.

Where do we go from here?

Luckily somefolks on the team have expertise using caching technologies, and on evaluating various technologies, we come to the conclusion that Redis is the right choice for us: it is fast, easy to integrate with, has a powerful API with many of the features that we need; it allows expiry of keys to ensure that we don’t retain stale results over time, it has data structures that provide easy ranking, native data structure mappings e.g sorted lists, lists, hash maps etc. Luckily we’ve got those experienced members on the team to help us bust those fears and scale the service!

With caching applied

Adding Redis to our architecture, it’ll now look like this

where for each client, they’ve got a data cache. A client is representative of the entirety of an application for example the iOS keyboard service, the Facebook Messenger bot service, the news room media service all now get their own Redis caches. Ideally we could singly place the cache around the rpc.searchService but that would mean that for 3 hours, clients would get back stale data or would be forced to all consume the cached data in a uniform way. The design choice is here that at anytime, the client can choose to get fresh data so should be able to by-pass caching — obviously that’s a monetary cost that they incur. Also, the ephemerality of data and world events is unpredictable — at anytime, an artist might release an album that goes viral or there might be breaking news, and those events radically change the results returned, hence it is upto the application client’s discretion on whether to fetch the latest or to use their cache.

That now makes the code look like this

Java client

src/main/java/io/mediasearch/search/MediasearchClient.java

pom.xml

The Maven pom.xml file

prometheus.yml

Go server

server.go

Go client

Makefile

Results

On cache miss:

obligatory cache miss with the Java client as seen on Stackdriver Tracing

obligatory cache miss with the Go client as seen on Stackdriver Tracing

obligatory cache miss from the Go client as viewed on AWS X-Ray

On cache hit:

Cache hit as seen on Stackdriver Tracing

And now that we’ve got caching, as you can see above the response returns in 7.0ms!!! Recall it was about 926ms — this is a 99.24% reduction in overall latency, huge win, our infrastructure costs and customers will be excited to get back speedy results.

Monitoring

reds_client dialLatency as seen on Stackdriver

redis_client dialLatency as seen on Stackdriver Monitoring

redis_client roundtripLatency as seen on Prometheus

redis_client roundtripLatency as seen on Stackdriver Monitoring

Server latency

The code for this post is available and has been open sourced at https://github.com/orijtech/itunes-search and there is a Facebook Messenger bot that I built using this app’s backend, some years ago and it still runs to this day https://www.facebook.com/searchyt/

I hope that with this post you were able to see the value of instrumenting your services for distributed tracing and monitoring, to give insights into our system as well as how to use the instrumented Redis clients in Java and Go. We have instrumented and are still instrumenting a bunch of clients as well as popular frameworks, to give observability into your systems using OpenCensus. With tracing and monitoring, you can even do automatic alerting for your services with Prometheus https://prometheus.io/docs/alerting/overview/ or Stackdriver Monitoring https://cloud.google.com/monitoring/alerts/ and add any other APM vendor — OpenCensus is vendor agnostic, thus any integration with any APM vendor is possible, only an exporter in your language of use is needed.

Some of the integrations work is viewable at https://opencensus.orijtech.com but most importantly, your presence would be very much welcomed by OpenCensus’ vibrant community. Please feel free to check out https://opencensus.io and Gitter channel https://gitter.im/census-instrumentation/Lobby as well as the Twitter account to keep in touch https://twitter.com/opencensusio

As previously mentioned, OpenCensus is developed entirely in the open and the code is available on Github at https://github.com/census-instrumentation

Thank you very much for your time and attention, for reading this far!

Kind regards,

Emmanuel T Odeke

odeke-em (Emmanuel T Odeke)

odeke-em has 206 repositories available. Follow their code on GitHub.

github.com

orijtech, Inc

All things software, connect experiences with it

orijtech.com