The Backend Reliability Engineer’s Toolkit

Andrey
Wrike TechClub
Published in
6 min readJan 18, 2024

Hello! I’m Andrei, and I lead a team that is responsible for the reliability and stable operation of the entire backend service here at Wrike. If you’ve ever wondered what exactly “backend reliability” is, rest assured you’re not the only one! As I often explain, the role is similar to a Site Reliability Engineering (SRE) but with an exclusive focus on the backend. In this article, I will talk about the areas of responsibility that lie with our BRE team and the tools we utilize to maintain stable backend operation at the highest level.

Backend Reliability Engineering team responsibilities

The Backend Reliability Engineering (BRE) team is responsible for the resilience, stability, scalability, and performance of the server part of the application. This is the team that comes to the rescue first when there are problems with the availability or stable operation of the backend. We do not add new features to the product that users directly interact with. However, if we fail to handle our tasks, and the application fails to withstand the load, users will definitely feel the impact.

For backend engineers, we provide optimized and reliable solutions that can be incorporated into the coding process. Examples of such universal solutions are libraries for working with databases, AMQP, caches, and others.

We are engaged in the development and enhancement of universal system components, which form the foundation for building our application. We extensively work with the basic blocks of the system, thereby simplifying the development process for all teams in the company. Additionally, we help resolve technical problems that might arise during the development process and also help deal with incidents such as spiking error rates for any given endpoint or a sudden increase in memory consumption due to high system load.

Logging, metrics, alerts

We have a large number of tools in our arsenal that assist us in our work, and I will briefly discuss the purpose of each of them.

Logging: We want the general components of the system to be transparent and provide useful information that we can use for analyzing incidents.

It’s better to keep logging in two repositories:

- A temporary repository storing all messages for a couple of weeks, which is sufficient for incident investigation or for monitoring systems operations in real time.

- A long-term repository containing only imperative logs, which are used for analyzing the system’s behavior over long periods of time or for calculating trends.

Metrics:While logging is beneficial, it’s not enough on its own. For a transparent system operation and incident analysis, metrics are essential. Metrics make it easy to observe trends in system load, user count, queries, and so on.

We incorporate general metrics: system load, the number and size of requests, the count of successful and unsuccessful requests, server response time, and others. It proves particularly useful to inject such metrics to shared libraries. This way, metrics will propagate throughout all services, making it unnecessary for individual teams to manually submit them.

For metrics, we mainly use a temporary repository, but we strive to retain the most important metrics in a long-term one. This helps us track trends over a long period of time.

Naturally, we also teach teams to create their own metrics and instruct them on where to send these metrics and how to visualize them on graphs. Visualization is the best tool for identifying the cause of an incident. Plus, we can generate alerts and other notifications based on metrics.

Here’s an example of visualizing 4xx service responses broken down by handlers:

The reasons for such bursts can be either 429 responses (the result of the rate limiter) or 400 responses due to mistakes in the request parameters made by users.

Alerts: While the SRE team does have their own basic alerts for service availability, server load, and so on, they don’t cover all the needs of the backend team. Therefore, we create our own alerts for scenarios where it’s critical for us to know if something has gone wrong within the application.

An example of such an alert could be a significant increase in the response time of an important handler. This might not directly lead to a service failure, but it highlights a performance issue. We create alerts not only for when things go awry but also to proactively prevent such situations from happening in the first place.

An example of alert notification in Slack.

We use both general alerts that apply to everyone (e.g., exceeded 500 rate per endpoint) as well as specific ones (e.g., the time to generate a report exceeded the acceptable limit). Each team receives their specific alerts in the team chat, while the general ones are sent to a channel where we handle them.

How to prevent incident recurrence

Logging, metrics, and alerts help us respond rapidly, understand the causes of incidents, and resolve problems. Our next step is to prevent the incident from recurring. For this, we employ the following tools: rate limiter, circuit breaker, and caches.

Rate limiter: This tool controls the number of requests a user can make. We use this tool to reduce the load on the system. It helps better distribute server resources among users and avoid clogging all pools with heavy requests or broken integrations.

We can flexibly configure the rate limiter, which allows us to give more resources to those who really need them. We use two types of controls: one to monitor the number of requests over a period of time, and another to monitor the amount of server time consumed per request. This helps prevent both an excessive number of small requests as well as cases when a user stays within the limit but their requests are exceptionally slow or resource-intensive.

This graph visualizes the number of users who used the app during non-working hours instead of enjoying their leisure time.

Circuit breaker (CB): This is a pattern that, in the event of service availability issues, stops sending requests to it until the availability is restored. This pattern helps us recover faster when problems occur with any part of our system or external component.

This tool is particularly useful when dealing with network connectivity issues between components. Requests to a system with an active CB stop timing out and immediately end with an error. This approach prevents pool clogging and helps return an error faster. We believe that degradation is better than complete inoperability. In some cases, when the CB is triggered, we completely cut off all open connections to the problem system and attempt to re-establish this connection.

Caches: We actively use caches when interacting with databases to unload them and retrieve data faster. We create both local caches (using Caffeine) within each service instance as well as distributed ones (Redis-based). This approach might induce problems with updating values in local caches of a distributed system. Nonetheless, we can tackle them by using the pub/sub mechanism for distributing invalidation events among services.

Conclusion

Such a basic toolkit simplifies our work, enables us to swiftly and effectively solve emerging problems, and aids in preventing these issues from recurring in the future.

Ultimately, it is worth emphasizing that behind all these tools and technologies are people.

It’s indeed thanks to the team of backend reliability experts — who proficiently utilize these tools and are always prepared to find solutions to any issue — that we’re able to ensure the high quality of our product.

We are constantly searching for new approaches and technologies that allow us to make our service more reliable and equip us to handle any load. We strive to proactively identify which parts of the system need improvement to meet the demands of our growing business. I will be happy to answer any questions and comments regarding the work of our team.

--

--