photo by Yung Chang.

Hystrix: from creating resilient applications to monitoring metrics

This post is about creating and configuring Hystrix commands which avoid catastrophes using fallbacks and makes our apps more resilient. As well as we look for monitoring our service using Prometheus and Grafana to keep in mind the benefits of monitoring our apps.

— I decided to write this post because in my attempts to use Hystrix for managing failures and monitoring applications some steps and concepts were cloudy and with a non clear documentation.

So having managed to use this very cool lib and having more conscience about the importance of resilience and monitoring I would like to share with you the main steps for trying it yourself, Hystrix!

We are always connecting our services

If you are designing a software system you will find today different services providing APIs and features through a communication protocol. So you have a range of services available to connect and build your platform, you no longer have to reinvent the wheel for most purposes and that's awesome!

You may also don't have connections to third party services within your applications, but you are probably building microservices (I strongly recommend you to dive into it if you haven't heard of it [1] [2] [3]) that remote calls your own systems.

Somehow you are doing remote calls.

And by remote calling work from external services you gain isolation, modularization, maintainability, scalability and much other goods. However, you increment responsibilities and have to manage guarantees on your system of it performing correctly.

If your remote service do not respond will your application still work? Or will it crash? Do you have an alternative path to go if the remote service timeout or respond you with a error ? Will your application fail in cascade or will that fail propagate to other systems?

Avoiding catastrophes

These points of connection to external services turn to be points of failure on you application which means if these external services do not respond properly your system can pass through a major outage! The more critical is the service most impact its failure will generate and propagate to others.

Oh noes!

Fortunately, for the majority scenarios of failure we would like to workaround it and offer an alternative solution.

In addition to that, we do not want that our application keep pursuing an failure service if we already know that it is unstable and not reliable at the moment.

Circuit Breaker is a software design pattern that proposes a monitoring of the remote calls failures so when it reaches a certain threshold it forwards all the calls to an elegant and alternative flux. Handling the error gracefully and/or providing a default behaviour for the feature in case of failure.

And when ours services are failing we would like to know when something went wrong, where was the failure point and was it a cascade fail?

Hystrix

So here we come to Hystrix, powered by Netflix, it is a implementation of Circuit Breaker and defines itself as:

A latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

Hystrix is a complete Java library that provides an awesome implementation of Circuit Breaker. This implementation provides a way of encapsulate your remote calls or whatever job inside a Command, perform fallbacks, short circuit commands, handle of concurrency, metrics for monitoring and more.

I will go into details of some features of Hystrix while we implement our ResilientPingCommand system because Isn't better to learn coding? Clone or download the source code which is our started point here: https://github.com/peaonunes/hystrix-implementation-example

🙌🏽 Hands on!

I strongly recommend you to use the IntelliJ IDEA since it eases a lot of steps for us! Take your time to download de Community Edition if you don't already have it installed.

Open the IDEA and open the project you just cloned on the last section. You will see that I have already set some classes, but most of them are stubs. Go to ApplicationMain and Run it, and you will see our awesome app printing a message every 1 second.

ping — — — pong

Our awesome application will ping a node service, this service came along in our repository, and on its folder you can run it through your terminal by:

$ npm install
$ npm start

Find the HTTPRequesterWorker on hystrix -> workers , you do not really need to bother on the class implementation, this classes will dispatch a GET to any url you pass to it, in our case, the node service url. It will serve well on simulating our call to a remote critical service.

Now that you are set let's start with adding the Hystrix to our project. We are using Maven to manage our Java packages. So go on the pom.xml file on project's root folder and add the dependency on dependencies section.

<dependency>
<groupId>com.netflix.hystrix</groupId>
<artifactId>hystrix-core</artifactId>
<version>1.5.11</version>
</dependency>

Refresh the dependencies and go for the class ResilientPingCommand. Make it now extends the HystrixCommand<String>, so now we should update the constructor of the command to configure it properly.

Though the Setter we define, using super, the commands config. I'm saying that this will be part of the "PingGroup". We have the definition of two methods as well, the run() method is the one which will perform the action and the getFallback() is the alternative behaviour on case of failure.

Let's now add some behaviour on the run(). Naively I am creating calling the worker.healthCheck() method which pings an url. If the content on exception key is not empty then we make it fail. I am also printing out some debug messages.

Since it is a Hystrix Command we can check if it the CircuitBreaker is open through the this.isCircuitBreakerOpen() method, so I am printing this info for debugging.

On the ApplicationMain we create a new command passing our URL and call the execute method. So every 1 second it will perform the command.

Try now launch the node service, navigate to simple-node-service folder and run npm start . After that go for your project on IDEA and run the main of ApplicationMain!

If we did it right you should be seeing every 1 second the status code 200 of success from our service.

Let’s now turn off the node service with ctrl+c …

Success! I mean… Something is going wrong and our fallback is being used… That's what we expected!

However, if you let the application running you will notice that the Circuit Breaker message never logs that the circuit is open (true).

That means the command is trying to be executed every time and it is failing every time, but we can improve that. We might want that the command short circuit to the fallback when we know that the service is not working…

Improving our fallback logic and the command configuration.

The Configuration section of Hystrix documentation describe all the changes you can implement on your command configuration.

So we are going to define the CommandProperties now, which will allows us to enable the circuit breaker as well as other configurations.

  • withCircuitBreakerEnabled will configure your command to be short-circuited.
  • withCircuitBreakerRequestVolumeThreshold set the max sequential number of failures for the command run() needed to the Circuit Breaker be opened.
  • withCircuitBreakerSleepWindowInMilliseconds will set the periodicity in which the Hystrix will stop short-circuiting and try perform the run() again. If it fails then the Circuit Breaker remains open.
Hell yeah, we short-circuited!

If you try the same test we did before, starting the node service then stopping it you should see that our commands are now short-circuited by the Hystrix.

After the fourth failure the CircuitBreaker is opened and we only get the message from the getFallback() method.

Nope, we are not back live yet =/

5 seconds after the Circuit Breaker had opened the Hystrix retries the command, but as it failed again the Circuit Breaker will remain open and short-circuiting the command as you can see in the image.

Now, try restart the node service so when the Circuit Breaker retries the command execution it will succeed and the short-circuit is over! Now our success messages are back again!

There is also another configuration I want to show you. We might want to set a timeout for the command execution. If your application cannot wait to long for some work to be done. On withExecutionTimeoutInMilliseconds you can set the max timeout for the command to fail.

You can try now running the node service and the Java application. As the node app has hot-reload try changing from 500 to 1500 milliseconds the delay on setTimeout function! After that, your command should be failing because our timeout configuration!

Hystrix flow.

This image is there to challenge you to read more about the How it Works documentation, that explains properly this Hystrix workflow! It’s very helpful and you can explore more on this lib configuration and features!

Monitoring

photo by Lorenzo Cafaro

To be failure tolerant is really important, however, if something is going wrong and ours applications are short-circuiting calls then we should do something!

We should monitor the health of our applications and track the signs of catastrophes to be ready to quickly workaround them.

Thinking on the side of our Hystrix commands if you want to do something on application level when the Circuit is opened, for example, you can go for HystrixEventNotifier.

— I will let the simples stub code of it along the final code result. EventNotifier captures the events dispatched from the Hystrix core so you can track them and perform some action.

On a monitoring level we probably would like to watch metrics of ours applications. In Java applications people normally manage the metrics through JMX which the standard technology for suppling monitoring.

Register Collector

Fortunately, for Hystrix we count with HystrixMetricsPublisher and just some lines of code are enough to enable and publish metrics from our application. Let's go for our pom.xml file again and add the following:

<dependency>
<groupId>com.netflix.hystrix</groupId>
<artifactId>hystrix-servo-metrics-publisher</artifactId>
<version>1.5.11</version>
</dependency>

We are adding on the dependencies the servo metrics publisher the default metrics publisher for the Hystrix lib.

Look for the HystrixConfiguration file that I left empty on the project. Here we can now define the EventNotifier I told before as well as we will define the ServoMetriscPubisher through the HystrixPlugins.

— There are a lot of interesting plugins that you might want to use, check them out here.

And now we can simply call our configuration class on the ApplicationMain somewhere before our magic while(true) code.

HystrixConfiguration.configureHystrixPlugins();

Try starting your application now. Everything is running like before, but how can I know if my application is publishing the Hystrix metrics? We can do this by using the jconsole ! Jump into your terminal and run jconsole .

I confess that I never heard of it before I had to check my hystrix metrics.

Choose our application and connect to it! We are going to see some charts and metrics of the running application and in the MBeans there will be folders of metrics definitions. One of them is called com.netflix.servo which has all the definitions for the published metrics from the Hystrix.

So, yeah! It is working for real! Since we have our metrics exposed, it is time to capture them now 🔴.

Prometheus

There are some solutions for capturing the metrics exposed by our Java service. One of these is Prometheus, it allows you to configure targets and at given intervals, evaluates rule expressions, displays the results, and can trigger alerts. Checkout the GitHub.

Note: Prometheus has a Docker image ready to pull from DockerHub (check my post of Docker later if you do not know what it is ⚠️). However, as it runs on a Docker container the application won’t be able to capture our metrics without some extra configuration. To hurry up on this, download and install the .gz from here.

In Prometheus some agents are publishers of metrics and others are the clients. So we need to configure our Prometheus client to collect metrics from a specific target in a certain interval.

We should create a prometheus.yml file on the root folder of the downloaded Prometheus project. The initial config will be the following:

# my global config
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'codelab-monitor'
rule_files:
scrape_configs:
— job_name: "hystrix"
static_configs:
— targets: ['localhost:9092']
— job_name: "prometheus"
static_configs:
— targets: ['localhost:9090']

Do not bother too much on that configuration. We are setting there a target for our Hystrix metrics and by default the target for the prometheus itself. Check out the GettingStarted for more details.

Now that we have it set up then let's go to the Prometheus folder and on our terminal and run the following command to start the project:

$ ./prometheus -config.file=prometheus.yml

Go into http://localhost:9090/ and you will be on Prometheus client! Here you are able to inspect some metrics, execute queries and create charts.

At the moment let's go to the targets tab, as you can see the target that we defined for Hystrix is down and the reason is simple: We did not publish our metrics to the target endpoint.

Java Agent producer

We have the Prometheus running and watching our target. So our Java application has to publish the metrics to the target url we specified. We are going to use the Prometheus JMX Java Agent for that. Download the .jar on the following link:

https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.9/jmx_prometheus_javaagent-0.9.jar

Create a folder called prometheus for the configurations and jar file. Move the .jar to this folder and create the file jmx_prom.yaml containing:

---
lowercaseOutputName: false
lowercaseOutputLabelNames: false

Now we should change the VM options on IntelliJ IDEA, go to: Run -> Edit Configurations -> Your application -> Configuration as described on this link. And change the VM options to this command:

-javaagent:./[JARAGENT_PATH]/jmx_prometheus_javaagent-0.9.jar=9092:[PATH_TO_CONFIGURATION]/jmx_prom.yaml

After applying this configuration if you run our java project the .jar agent will be executed and will be publishing the metrics to our target url. Run the Java application now.

If we go back to the targets section on Prometheus client we will see that the Hystrix is now Up and we can explore the data now. Let's create a graph and execute a query for com_netflix_servo_countShortCircuited_value !

While the command is succeeding the line for this metric should be zero. But try turn off the node service and wait the short circuit to open then we will see the number of short-circuited commands increase.

— Prometheus is really doing its job, but we can do better on monitoring and analysing!

Grafana

Prometheus is a good platform to collect metrics from a source, but when it comes to analyse the metrics Grafana is a powerful new option. This tool is a open source project that eases the creation of Dashboards for monitoring your applications.

Grafana is trusted by thousands of companies.

This tool can retrieve the data from different data sources like Graphite, Elasticsearch, InfluxDB, OpenTSDB. You will find a very helpful docs and they have free/paid hosted plans if you don't want to administrate the entire infra.

Let's try some of Grafana locally! Look forward the installation guide for installing it properly on your environment. I'll be using with Docker for convenience, if you have Docker installed then simply run:

$ docker run -d -p 3000:3000 grafana/grafana

Now go http://localhost:3000 and you will see the Grafana Dashboard. Login using admin and admin as the credentials and let's now build our Dashboard.

When we logged in we will se some steps, first is choose a datasource.

All done! It's reaching our source!

I called our source Hystrix and set the type the Prometheus. On the URL just set the one where Prometheus is running (localhost:9090/).

I chose direct access, but there is also a option for proxying. Then click on Save & Test for checking if the Grafana can connect to the source!

We just need to create our Dashboard now. Go to Dashboard menu and create a new Dashboard. Try add a simple Graph, right click the name for editing and do any query for our metrics! I just did two charts, the first I chose to observe the value of the CircuitBreaker for our Command.

com_netflix_servo_isCircuitBreakerOpen_value{exported_instance=”ResilientPingCommand”}

I set an Alert so whenever the value becomes above a threshold it will alert me, you can define users and alert some specific or a group of users as well. When I stopped the node service the CircuitBreaker got active and it reflected and dispatched an Alert on the Grafana. 🚨

In the right chart I am checking out the Latency of the command execution and the Success count as well. And set another alert for the latency value. As I did to the first chart I used the exported_instance attribute to retrieve just the data of our Command.

In this way you can setup charts for different commands and analyse them separately. Then, you can analyse the performance for different services that you are remote calling and create dedicated visualizations to their communication.

On the Metrics and Monitoring section of Hystrix documentation you get the entire explanation and motivation of what metrics are exposed.

Summing up

This post is getting bigger than I expected so I'm going to wrap it here. All the Java code we have done is available here!

I hope I shared the goods of using Hystrix which is indeed very easy to use. And I hope that you enjoyed reading and playing around with the monitoring frameworks.

Failing gracefully and monitoring applications are bigger topics themselves that could handle a lot of discussion, patterns and studying. However, I hope that you got your mind stimulate on thinking about these topics and that you got excited to explore more of this!

If this article helped you to like it 👍🏼, if it did not fulfill your expectations please send your feedback ✉️ and always share your thoughts in the comment section below 📃!

That's all for now =]