Linkerd Retries-

gaurav agnihotri
5 min readMay 23, 2023

--

Achieve 100% Success Rate with Linkerd Retries for Intermittent Issues!

Why are Retries Required?

Linkerd retries enhance the resiliency of service-to-service communication by automatically retrying failed requests, leading to improved reliability, better user experience, and increased fault tolerance within the service mesh.

Here are a few key points about Linkerd retries:

  1. Error handling: Retries help in handling transient failures that can occur due to network issues, temporary service unavailability, or high load situations.
  2. Seamless recovery: By retrying failed requests, Linkerd helps in achieving seamless recovery without burdening the client application with complex error handling and retry logic.
  3. Improved user experience: Retries can prevent errors or failures from being directly exposed to end-users. Instead of experiencing a failed request, users may see a slightly delayed response as Linkerd retries and eventually succeeds in delivering the requested data.

Let’s take an example application and explore the benefits of retries.

To get started, let’s install the books app onto your cluster.

 ⚙ 🍩  kubectl create ns booksapp && \
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/booksapp.yml \
| kubectl -n booksapp apply -f -

namespace/booksapp created
service/webapp created
serviceaccount/webapp created
deployment.apps/webapp created
service/authors created
serviceaccount/authors created
deployment.apps/authors created
service/books created
serviceaccount/books created
deployment.apps/books created
serviceaccount/traffic created
deployment.apps/traffic created

you can access the app itself by port-forwarding webapp locally:

kubectl -n booksapp port-forward svc/webapp 7000 &

Now try to “add book” using the dashboard and see you will get intermittent errors like below.

Try to Add a book

As soon as you click on the [Add book] button, you will see the page below -

(if you are able to create/add a book entry on the first attempt try one more entry to add as this is an intermittent issue :P )

Intermittent error page

Add Linkerd to the service

it's time to inject the LinkerD annotation and make this service part of the LinkderD mesh.

kubectl get -n booksapp deploy -o yaml \
| linkerd inject - \
| kubectl apply -f -

Check the output and see the pod’s container number — 2/2 (means now services are meshed)

Booksapp applications have meshed now

Now Check the Success rate of your application by checking the dashboard stats -

Select booksapp from the namespace dropdown and click on the Deployments workload. You should see all the deployments in the booksapp namespace show up. There will be success rates, requests per second, and latency percentiles.

Webapp Details

you’ll notice that the success rate webapp is not 100%.

We can see that the books service is also failing. Let’s scroll a little further down the page, we’ll see a live list of all traffic endpoints that webapp is receiving.

Check the book's success rate

Let’s click on the tap (🔬) icon and then on the Start button to look at the actual request and response stream.

Multiple 500, Intermittent issue

It was surprisingly easy to diagnose an intermittent issue that affected only a single route.

Now pass this information to the books service owner, and he/she will take charge of fixing the code.

Retries —

Since updating and deploying a new version of the code can be time-consuming, we can inform Linkerd to retry requests to the failing endpoint instead. Although this may result in increased request latencies due to multiple retries, it eliminates the need for a new version rollout.

To Get into the Retries, We need Service Profiles -

One of the easiest ways to get service profiles setup is by using existing OpenAPI (Swagger) specs.

To get profiles for authors and books, you can run:-

Authors -

curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/booksapp/authors.swagger \
| linkerd -n booksapp profile --open-api - authors \
| kubectl -n booksapp apply -f -


serviceprofile.linkerd.io/authors.booksapp.svc.cluster.local Created

Books-

curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/booksapp/books.swagger \
| linkerd -n booksapp profile --open-api - books \
| kubectl -n booksapp apply -f -


serviceprofile.linkerd.io/books.booksapp.svc.cluster.local Created

Now these profiles can be used to observe outgoing requests as well as incoming requests. To do that, run:

In this application, the success rate of requests from the books deployment to the authors service is poor. To see these metrics, run:

linkerd viz -n booksapp routes deploy/books --to svc/authors

As per Above Screenshot — It is evident that all requests from books to authors are to the HEAD /authors/{id}.json route and those requests are failing about 50% of the time.

To correct this, let’s edit the authors service profile and make those requests retryable by running:

kubectl -n booksapp edit sp/authors.booksapp.svc.cluster.local

And Add `isRetryable: true`

spec:
routes:
- condition:
method: HEAD
pathRegex: /authors/[^/]*\.json
name: HEAD /authors/{id}.json
isRetryable: true ### ADD THIS LINE ###
Editing done

See the Retries magic-

Now the success rate is 100% , but as discussed, it will increase the latency, which is to be expected because doing retries takes time.

I hope this post is informative and useful for you :)

If you enjoy the blog, please give me a Clap : ) and Follow me for more such content.

Crafting these articles demands countless hours of ideation, research, and writing. This year has seen me invest over 500 hours into this craft alone. If my work has brought you joy, would you kindly consider supporting me with a coffee? Your gesture would mean the world to me. If not, thank you dearly for your readership. ❤️

Buy-me-a-coffee ❤️

--

--