Setup a distributed tracing infrastructure with Zipkin, Kafka and Cassandra
For a master table-of-contents for blog posts on microservice topics, please refer — https://medium.com/oracledevs/bunch-of-microservices-related-blogs-57b5f1f062e5
This blog will demonstrate how you can set up a scalable distributed tracing infrastructure on Oracle Cloud
One of my earlier blogs showcased a simple example of distributed tracing for microservices (based on Spring Boot) using an example revolving around Spring Cloud Sleuth and Zipkin
There are couple of key points which need to be highlighted with regards to the default behavior of the system
- Sending data to Zipkin — Spring Cloud Sleuth sends spans/trace data to Zipkin over HTTP
- Persisting the trace data — Zipkin stores span data in-memory
This is obviously not a bullet-proof solution
- Single point of failure — Since Zipkin is core component of our distributed tracing solution, it also has the potential of becoming a bottleneck both from an availability and performance standpoint
- Ephemeral storage — since Zipkin stores span data in-memory, it will be lost if the Zipkin instance restarts or crashes
Although nothing is perfect, this situation can be improved. Here is an overview of what this blog will demonstrate in terms of the solution
- Zipkin runtime — we will continue to use Oracle Application Container Cloud as the runtime for the Zipkin server
- Cassandra for persistent storage — Oracle Data Hub Cloud is a managed Cassandra offering which we will use as the NoSQL persistent store for Zipkin
- Kafka as the message bus — Oracle Event Hub Cloud is a managed Kafka service which will be leveraged in order to decouple Spring Cloud Sleuth from Zipkin
Time to dive deeper… !
Use case overview
The use case is still the same as the earlier blog i.e. distributed tracing. But this particular blog is focused on the infrastructure side of things, hence the sample itself is very simple — so all we have is a single Spring Boot app (named inventory) to test things out (as opposed to a couple of microservices which were used as an example to illustrate the concepts in the earlier blog post)
Here is the high level flow
- the inventory app sends span data to Kafka..
- … which is (asynchronously) consumed by Zipkin and persisted to Cassandra
The key focus is on
- revamping the Zipin server to communicate to Kafka and Cassandra on Oracle Cloud
- updating the inventory service to talk to Kafka rather than interface directly with Zipkin
Architecture components
Here is a diagram to give you a high level idea of what the architecture is..
Kafka (on Oracle Event Hub Cloud)
As mentioned above, since Kafka sits between your application (which is using Spring Cloud Sleuth) and Zipkin, trace data will be sent to Kafka which Zipkin can consume asynchronously and persist to Cassandra
Although it is another piece of infrastructure to deal with, the obvious benefits are
- It acts as buffer and helps manage the back pressure — think multiple microservices pumping trace data
- It also helps with availability — if the Zipkin cluster is unavailable, applications still continue enqueue data into Kafka which can be picked up by Zipkin when its back up
Both Zipkin server as well as the individual applications talk to Kafka
The Zipkin server component…
- … uses
@EnableZipkinStreamServer
(thanks to thespring-cloud-sleuth-zipkin-stream
dependency) annotation to signal that it will be receiving span/trace messages from Kafka - the Spring Cloud Stream Kafka binder is pulled in via
spring-cloud-starter-stream-kafka
and this takes care of the Kafka consumer part - the
application.properties
usespring.cloud.stream.kafka.binder.brokers
andspring.cloud.stream.kafka.binder.zkNodes
to specify the co-ordinates of Kafka and Zookeeper respectively
Our Sleuth enabled Spring Boot microservice ..
- .. also pushes trace/span data to Kafka
spring-cloud-sleuth-stream
and (the usual)spring-cloud-starter-stream-kafka
modules - it also uses the same
application.properties
as the Zipkin server to specify Kafka and Zookeeper co-ordinates
Cassandra (on Oracle Data Hub Cloud)
Only the Zipkin server (not the applications) interacts with Cassandra to persist the trace data/messages it receives from Kafka. It uses a Java driver for Cassandra and zipkin-autoconfigure-storage-cassandra3
module
The key highlights are the configuration parameters in application.properties
zipkin.storage.type
— value used is cassandra3zipkin.storage.cassandra3.contact-points
— where is the Cassandra clusterzipkin.storage.cassandra3.username
andzipkin.storage.cassandra3.password
— quite obviouszipkin.storage.cassandra3.keyspace
— which keyspace should Zipkin create objects inzipkin.storage.cassandra3.ensure-schema
— if set to true, Zipkin automatically seeds the schema with the required objects (UDTs, tables etc.).. isn’t that sweet !
More details on the Cassandra objects in the next section
Service Bindings in Oracle Application Container Cloud
Its important to reiterate as to how important Service Bindings are and they how they make things much simpler in terms of secure and hassle-free communication between applications (Zipkin and other microservices) and downstream infrastructure components like Kafka (Eventh Hub) and Cassandra (Data Hub)
All you really need to do is declare a dependency (binding) and you’ll get a internal communication channel set up for free without punching holes in the firewalls/access rules of the respective services
This concept will be highlighted in the upcoming sections and you can always refer to the official documentation for more info
Let’s quickly cover the setup for our infrastructure components
- Cassandra, and
- Kafka
Infrastructure setup
Oracle Data Hub Cloud
Provision Cassandra cluster
Start by bootstrapping a new cluster — detailed documentation here
It is also possible to do this using a CLI
Below snippet shows a basic single node cluster running Cassandra 3.10.0
Check Zipkin related schema objects
As mentioned above, Zipkin related Cassandra schema objects are auto created — these include user defined types and tables
To check these,
- SSH into the Cassandra cluster node — documentation available here
- Fire up cqlsh — execute
sudo su oracle
(relevant information here) and thencqlsh -u admin `hostname`
- get the details of your zipkin specific keyspace (which you configured in application.properties) — e.g.
desc zipkin
.. if don’t remember it, just usedesc keyspaces
and you should see yours there
Here is a screenshot of some of the objects
Oracle Event Hub Cloud (Kafka broker)
The Kafka cluster topology used in this case is relatively simple i.e. a single broker with co-located with Zookeeper). You can opt for a topology specific to your needs e.g. HA deployment with 5-node Kafka cluster and 3 Zookeeper nodes
Please refer to the documentation for further details on topology and the detailed installation process (hint: its straightforward!)
Creating custom access rule
You would need to create a custom Access Rule to open port 2181 on the Kafka Server VM on Oracle Event Hub Cloud — details here
Oracle Application Container Cloud does not need port 6667 (Kafka broker) to be opened since the secure connectivity is taken care of by the service binding
Build, configure & deploy
Start by fetching the project from Github — git clone https://github.com/abhirockzz/accs-zipkin-tracing-infra-with-kafka-cassandra.git
Build
Zipkin server
cd zipkin
mvn clean install
The build process will create zipkin-dist.zip
in the target
directory
Inventory service
cd inventory
mvn clean install
The build process will create inventory-dist.zip
in the target
directory
Configuration
Before you launch these into the cloud, please tweak the configuration parameters as per your environment
deployment.json
and manifest.json
for Zipkin server
deployment.json
and manifest.json
for inventory service
Now that you have configured your apps, it’s time to deploy them
Deployment a.k.a push to cloud
With Oracle Application Container Cloud, you have multiple options in terms of deploying your applications. This blog will leverage PSM CLI which is a powerful command line interface for managing Oracle Cloud services
other deployment options include REST API, Oracle Developer Cloud and of course the console/UI
You can download and setup PSM CLI on your machine (using psm setup
) — details here
Here are the CLI commands to deploy the apps
Zipkin — psm accs push -n zipkin -r java -s hourly -m manifest.json -d deployment.json -p target/zipkin-dist.zip
Inventory — psm accs push -n inventory -r java -s hourly -m manifest.json -d deployment.json -p target/inventory-dist.zip
Sanity checks
- check service bindings, and,
- access the Zipkin server to confirm its functional
After the Zipkin server has been successfully deployed, you can check Service Binding its details by navigating to the details screen -> Deployments section — you will see both Data Hub (Cassandra) and Event Hub (Kafka) bindings along with their respective environment variables (cropped image)
Same applies for the inventory service (it binds only to Kafka)
Finally, access the Zipkin server — note the app URL e.g. https://zipkin-<mydomain>.apaas.us2.oraclecloud.com
Test drive…
Alright, everything is setup for us to test things out
Invoke inventory service
To start with, check the URL of the inventory app and invoke it a few times
curl https://inventory-<mydomain>.apaas.us2.oraclecloud.com/inventory/iPhone4
curl https://inventory-<mydomain>.apaas.us2.oraclecloud.com/inventory/iPhone5
curl https://inventory-<mydomain>.apaas.us2.oraclecloud.com/inventory/iPhoneX
Access Zipkin dashboard to see the span data
If you dig down further into the span, you’ll see more details
Above is the span from our invocation of iphoneX
(one of the three invocations listed above). This is relatively simple since you just have single service, but the same would apply if you had a chain of invocations with different (micro) services involved
If you dig in even further, there is more to unearth — focus on the highlighted info
Span data in Cassandra
you can double check Cassandra as well. Using cqlsh
, you execute select * from zipkin.traces;
(assuming zipkin
is the keyspace name)
You can also query other related tables
What about Kafka ?
As you might understood, the span data is sent by Sleuth to a Kafka topic (unsurprisingly named sleuth
) which is then consumed by Zipkin and persisted to Cassandra.. try this out as well
Use the kafka CLI to set up a consumer — kafka-console-consumer.bat --bootstrap-server <event_hub_kakfa_IP>:6667 --topic sleuth
Invoke the inventory app again — curl https://inventory-<mydomain>.apaas.us2.oraclecloud.com/inventory/AppleWatch
You will see a rather cryptic message whose contents are pasted below for your easy reference — recall that you saw the same trace data in Zipkin before
♣♂contentType ♀”text/plain”‼originalContentType “application/json;charset=UTF-8”♠spanId ↕”e7bd0aabba1fa09c”♂spanTraceId ↕”e7bd0aabba1fa09c”♂spanSampled ♥”0"{“host”:{“serviceName”:”inventory”,”address”:”172.17.0.2",”port”:8080},”spans”:[{“begin”:1516776180246,”end”:1516776180258,”name”:”http:/inventory/AppleWatch”,”traceId”:8394174692565835560,”spanId”:8394174692565835560,”exportable”:true,”tags”:{“mvc.controller.class”:”InventoryApplication”,”http.status_code”:”200",”mvc.controller.method”:”getInventory”,”spring.instance_id”:”inventory.oracle.local:inventory:8080",”http.path”:”/inventory/AppleWatch”,”http.url”:”http://inventory-<mydomain>.uscom-central-1.oraclecloud.com:443/inventory/AppleWatch","http.method":"GET","http.host":"inventory-<mydomain>.uscom-central-1.oraclecloud.com"},"logs":[{"timestamp":1516776180246,"event":"sr"},{"timestamp":1516776180258,"event":"ss"}],"durationMicros":12550}]}
Summary
Well, that’s all there is to this blog post which was the second (and final) installment in the distributed monitoring series..
A quick recap never harms!
- we covered the drawbacks of the default monitoring setup and introduced Kafka and Cassandra as core infrastructure components to overcome some of the issues
- went through the details of the architecture
- provisioned the infrastructure on Oracle Cloud as ready-to-use managed services — Data Hub Cloud and Event Hub Cloud
- built and deployed our revamped apps to Oracle Application Container Cloud
- and finally, tested everything end-to-end !
Don’t forget to…
- go through the Oracle Data Hub Cloud documentation for a deep dive
- check out the tutorials for Oracle Application Container Cloud — there is something for every runtime!
- other blogs on Application Container Cloud
Cheers!
The views expressed in this post are my own and do not necessarily reflect the views of Oracle.