Spring Cloud Gateway and Its Resilience

Tegar Budi Septian
Blibli.com Tech Blog
9 min readApr 25, 2022

Have you watched the trailer of the latest Makoto Shinkai’s movie? Suzume no Tojimari. The story begins with door being opened and brings calamities around Japan.

Another one is, The Project Adam movie. Adam Reed, a pilot from 2050, flew a stolen time jet and opens a black hole in a goal to find his wife in 2018 — A time travel. He accidentally crash-landed in 2022 instead.

From both stories, we can see at least one similarity: failure happens when entering an entrance, especially the The Project Adam, detouring traveler from the requested destination.

In this article, we will talk about Spring Cloud Gateway and its resilience when facing failures. Just like the stories!

Spring Cloud Gateway

Spring Cloud Gateway is an API gateway’s library on top of Spring WebFlux — Java — used to route to APIs and provide cross cutting concerns (e.g resiliency, monitoring/metrics, security).

Features

Built on Spring Framework 5, Project Reactor and Spring Boot 2.0.

Routing requests based on a number of criteria (filters).

Fallback.

Retry.

Circuit Breaker Integration — Resilience4J.

Monitoring — Spring Boot Actuator > /actuator/gateway.

DiscoveryClient support — Consul / Eureka.

Security — Ex. User Authentication.

etc.

Case Architecture

Here, we have a microservices architecture where router-service as an API gateway and there are three downstream services: member, sampah(an Indonesian word for trash/garbage/waste), and notification. Between router and the downstream services, there is a red box, resilience.

For every request client sends to APIs, each of them will pass through the router service (or let’s call it just Router) as a gateway before reaching services. Router will filter the requests and perform actions as configured, then they will be forwarded to downstream services. In this process, occasionally, failure occurs.

Failures may vary. It may be caused by network issue or service down. Router will handle all kind of situations, in order provide proper response to clients. For example, a down service will return 500 Internal Server Error to the gateway. But with Spring Cloud Gateway (SCG), it can return tailored response to the client/user, like “Hmm, something is wrong. Try again in seconds.” instead an just error code. This is what we call resilience.

Written in the Features section above, resilience includes Fallback, Retry, Circuit Breaker, and so on. We will take these up as we read all the way from top to bottom.

Routing

The main function of API gateway is routing. Requests will be routed to the specific destination. If the destination is available and everything works as it is, it will return response to the client as desired. If the destination is not available, it will reply to client that it is not found.

For our test cases, we use router and sampah services (built from Spring Boot) to give us example.

Sampah Service

sampah is a service with a pretty simple API to get the list of sampah. Note that we set 8001 as its port.

Router Service

A Spring Cloud Gateway service. We need to add dependency to router so we can use the SCG contents:

The configuration is stored in application.yml of router. For specific, it is under spring.cloud.gateway.

Let’s focus on line 11–33, the spring.cloud.gateway.routes properties.

There, we have properties id: sampah-route. It is an id for the underlying services, combined with uri, values the host and port of sampah service. In other word, sampah is registered to the api gateway. (line 12–13)

Then we have Predicates — Path=/sampahs/**. Every endpoints that match this path will be routed. (line 15)

And set of filters. Filter is a standard Spring’s WebFilter. It is able to make modifications on the HTTP request as well as HTTP response. Look at the image below to see on how routing work.

Client will make a request (e.g HTTP GET: localhost:8000/sampahs). The SCG forwards the request to Gateway Handler Mapping that determines what should be done with request matching a route. Then, it send the request to Gateway Web Handler to execute the specific filters of this route. With WebFilter, we can modify the request.

Line 17, AddRequestHeader=X-Client,web. Before the request is being forwarded to the sampah service, the request is injected with a new header, X-Client with value ‘web’. Adding this modification will indicate that the request is from web client. There will be some logic at downstream services to utilize this parameter.

After the request is forwarded to sampah, router will receive response from sampah. At line 18, AddResponseHeader=X-Type,inorganic, we want to add additional information to the response that the data is inorganic waste.

Testing

So this is when we hit the http://localhost:8000/sampahs API.

8000 is the router port. router successfully routes the request to sampah and return the list of sampah. There is additional response’s header as well, X-Type.

I have installed Zipkin as tracing tool and Grafana for visualizing data on my machine. Zipkin add a trace id to each log output. In SampahController.java, line 22, every time the /sampahs API called, it will log a message, “Returning all sampah”. We can trace with the trace id of the log to see the chain of routing.

Zipkin provides routing chain details.

We are able to see the routing steps. Request come to router service as get request, then it is forwarded to sampah service, so on and on. There are also informations of time duration, start time, http method, ip address, endpointType, traceId and spanId, etc.

Retry

Network issue or service down are sometimes only temporary. It may happen only in seconds for a short time. We can do retry by having consecutive request attempts assuming the service would be available.

Back to the router properties, application.yml. In line 19–25. We define a Retry filter with some arguments:

retries: 3 — It will do three times maximum retry.

methods: GET — Only request with GET method that take effect.

backoff.firstBackoff: 50ms — It will retry every 50 milliseconds.

backoff.maxBackoff: 500ms — the maximum backoff.

Fallback

Adam landed in the wrong year. It did not explain us the detail of the cause. It could be a system failure or human error. But we know that the jet still gives Adam a destination, even though it’s not what he wants, it is better than being lost in the void or even die — the failure.

That is fallback. If the downstream service is in a trouble, let’s say it is down or for a reason it can not return the appropriate response — not available — , we can forward the request to another endpoint. So the user will not face a system error. Instead, we can return a better message or even for /sampahs API case we can return a cache.

In application.yml of router above, there is a filter called CircuitBreaker where it has argument fallbackUri.

name: sampahService — The name of circuit breaker.

fallbackUri: forward:/sampahs-fallbackWhen facing fallback, it will forward to this endpoint.

Here we have a controller in router to handle the fallback.

Thus, it returns a message “Kayaknya lagi ada yang nggak beres nih.”. In English: “Hmm, something is wrong. You are being landed to 2022…”.

Or we can also get cached sampah data from previous request.

Testing

Here we have sampah service turned off. Without fallback, it returns 500 Internal Server Error. With Fallback, there is a better message shown to the user/client.

Circuit Breaker with Resilience4J

Resilience4J is a java library for implementing different resilience pattern such as circuit breaker.

We can make the resilience of our microservices more robust by implementing circuit breaker. The concept is just like an electricity circuit. The electricity will flow if the stream is closed, once it is tripped, it will not flow.

As well as request traffic, when the circuit Closed, requests are distributed from router to sampah. After the failure rate surpasses certain threshold, the circuit turned Open. When it is Open, no request comes to sampah service.

Then, to check if the situation changes, every wait duration for example 10 seconds, the circuit becomes Half Open and allow few requests being examined. If failure rate is below threshold, it will become Closed, otherwise it will turn back to Open.

I get the router properties down here so you don’t need to scroll up now.

It is started from Line 26–29, then we have to configure the circuitbreaker attributes from Resilience4J as written on Line 35–47.

slidingWindowSize: 10 — For every last 10 requests will be examined in Closed state.

failureRateThreshold: 50 — If the 50 percent of last requests are failure, the circuit will change to Open.

waitDurationInOpenState: 10000 — The circuit becomes Half Open after 10 seconds.

permittedNumberOfCallsInHalfOpenState: 5 — In Half Open state, we allow 5 requests that will be examined to see if the situation changes.

timeoutDuration: 2s — We want the users experience fast request, fast system. So we set a timeout. If the request takes more than 2s, it will be returned. No long wait.

Testing

For this demo, I already installed Prometheus (Metrics Monitoring), Grafana, and ApacheBench (able to make an amount of requests) to simulate the circuit breaker concept.

  1. 10 Requests — sampah up — Closed

Give attention on RouterApplication and SampahApplication boxes. It has green value indicating that the service is ON. No green color means down. I made 10 requests on /sampahs API. All of them were successful because sampah service is on. After if we confirmed on Prometheus metrics with Grafana, it is indicated as Closed.

2. 6 Requests — sampah down — Open

Now, sampah service is not green, it is down. 6 requests were failed, the failure number were more than threshold (50% of 10).

However, why is the Failed requests is 0 and Complete requests is 6 in snapshot? It’s because we implemented fallback. So, it will still read as not failed in router. But in Prometheus metrics, the failures of sampah are still detected.

As expected, it turned to Open.

3. After 10 seconds, it becomes Half Open.

4. 10 Requests — sampah up— Closed.

In Half Open, only last 5 requests were being examined. If 50% of them success, it would be Closed, otherwise it would turned Open.

We sent 10 requests, the success rate is more than threshold, hence it turned to Closed.

The time jet is self healing. Thus Adam is able to travel to his initial destination, 2018.

--

--