Redis ReConnection Resiliency

Published in

Think Special — Gopi K Kancharla

7 min readMar 10, 2019

Cross Region — Connections auto rehydration.

👉 Background:

It is a world of micro-services. Such applications or micro services are required to store data temporarily with frequent and super quick access to avoid disk IO operations using Redis like In-Memory databases.These applications have multiple in memory database clusters to handle huge amounts of traffic and to avoid request failures. To access this data quickly, applications are required to have the preconfigured, established pooled connections ready for service from the applications.

👉 Problem Statement:

Applications built for resiliency have backup options in case of application or infrastructure failures. In-Memory database clusters that exist in different data centers on different servers allow for backup connectivity in case of data center or server issues.

A multi-region resilient application utilizing an In-Memory database should be able to reconnect the cluster in case of a disconnect and connect to a backup cluster if the primary cluster is unavailable to avoid request failures.

A true region-agnostic solution will dynamically select the In-Memory database cluster for persisting or retrieving data, making the transaction seamless to the client even in the case of errors when the primary cluster connections are getting rehydrated. This dynamic rehydration is a problem while relying on the modern automated dependency injection and auto-wiring object manipulation technologies.

Once an incoming request had reached the application, the request should be handled correctly even if the underlying infrastructure in that region experiences failures.

A services an incoming request need to be persisted or fetched dynamically from another backup In-Memory database cluster when the primary cluster has issues. At the same time, for subsequent requests, connections with the failed cluster will get rehydrated from its connection pool and ready to serve without having a dependency on the backup clusters.

👉 Solution Summary:

A simple solution to the above problem is recreating all the connections that were already pooled against to the in-memory database that has connectivity issues .

Use the newly established connection pool and making sure for the rest of the transactions new connections will be used from the newly cached pool set.Key Technical aspects were to manage these pools dynamically during the runtime.

Singleton:

A service that had a custom extended framework to instantiate objects to only one instance using the singleton design pattern. For In-memory databases, the Framework allows the application to instantiate one template(like Spring RestTemplate) to handle all database transactions. The underlying template, which implements the Connection Factory and Pool Configuration establishes a connection to execute the transaction. When this template is rendered unusable for establishing connections, it will self-heal upon rehydrating the pool of connections from its factory. The Framework will discard the older template and create a new one. This new single template will be used to serve all the incoming requests.

Active & Backup Connections:

A service that creates/uses multiple in-memory database templates: one to connect to the primary cluster, and another to connect to a backup cluster. The primary cluster is determined by the “region” of the server the application is running on. A region can represent a geographical location and/or data center. The same application running in multiple regions will connect to the corresponding database using the template configuration mechanism.

Cross Region Resiliency:

Every service that is a “cross-region” resilient, meaning that the application is deployed on multiple servers in multiple regions and the underlying in-memory database cluster infrastructure is similarly deployed on different servers in different regions.

If an entire “region” experiences issues, an incoming client request will be handled by the application running against the auto replicated in another region. However, once a request has reached the application, if the underlying database cluster experiences issues, service is dynamically forwards the operations to the backup cluster and should successfully complete the request. At the same time, templates will start the process of discarding all the pre-created connections and will start recreating the new set of connections configured per the pool configurations supplied and establishes a new template for subsequent operations to be successful for the failed region, until it is successful.

Connection Rehydration:

A broken connection between the in-memory database cluster must be removed, as it is an unrecoverable error if no action is taken. The framework, when implemented as a singleton connection, holds on the broken connections and does not easily support removing the connections. An incoming request will fail if the connection provided by the template has been broken previously. Service must supports runtime connection re-hydrations by removing all the connections from the pool and recreating the new. When a connection is broken, the application recreates the templates and connections to provide a clean reconnection to the in-memory database cluster for the next transaction.

Runtime and Performance:

Every service should determine the primary and backup region on startup to amortize the dynamic decisioning on which database cluster to perform the default operations on. When an operation fails to perform on an active chosen cluster, service should recreates the underlying Templates and Connections and forwards the operations to the backup cluster. The implementation allows for quick, dynamic reconnection multi-region resiliency of database clusters. Since the newly created templates are singleton and fetched during runtime hence the performance impact is avoided expect for the one that is failed at first.

Detail Description:

Overview:

Connecting to in memory databases using existing technologies is pretty simple. The problems that we generally miss in 90% of applications are when the databases are rehydrated or when a network connection is being lost temporarily.

The application will have a pooled set of pre-configured connections which are cached and produced from a factory upon having an executable eviction algorithm at configured intervals to verify the connection validity. The below diagram explains multiple templates that connect to multiple region based in-memory databases. Each template configured with a factory of connections that were created and assigned to a pool to hold. The factory will then retrieve a connection from the pool on a need basis when a transaction is requested to execute.

Lost Connections:

The below diagram represents when a connection is being lost because of DB rehydration or a DB restart or a DB Network connection issues or a firewall issue or because of any other reasons we can imagine. The entire pool of connections has now become invalid as the connections have lost the socket connection, resulting in connection refused error messages upon requested transactions.

Invalid Connections Pool

Even after the database comes back up, the entire pool of connections are still invalid because they will start showing broken pipe messages to re-establish a connection upon losing a track of being what happened to the database in the meantime.

Rehydrated Connections:

All the connections inside a pool require a rehydration to execute a successful transaction. This requires the factory to create new connections to hold via a pool for the corresponding templates, which also need to be recreated. When a failed connection is detected, the system is now intelligently built to recreate the all the templates connected to corresponding region-based databases.

Connections will be held in a thread context and will not get closed or return to pool for quick next transaction executions. These are called cached connections from the pool to a thread context. When a random connection fails to execute a transaction, it’s difficult to find which connection had been used to run a transaction and how many of such broken connections exist inside a pool of connections. Rather, it takes less time to re-create the pool of connections than to go through each one of the connections to close. Also, closing these connections eventually leads to a runtime pool exhausted status or to no connections available for upcoming transactions to execute.

After restarting the In-Memory Database server, operation on hold with the Connection in a thread will start throwing broken pipe errors. To avoid this ,rehydration of connections is the best solution as represented below.