Production Replay System for Cloud Software Applications

Published in

The Socure Technology Blog

5 min readDec 14, 2022

By Venkatesh Prabu Narayanan, Senior Software Engineer; Sathya Srinivasan, Senior Staff Software Engineer; Swami Subbarathnam, Senior Director of Engineering

Introduction

Cloud software applications are often expected to support many new features in a short time with high quality and availability, and their scope and complexity have been growing.

Problems

These are the problems we’re addressing:

The set of use cases a cloud software application supports is already substantial, and the growing number of new features only adds more. The N x N dependencies across current and new features were impacting accuracy and functionality which in turn was leading to complexity in development and test. So, the interoperability expectation of all these features has increased the scope and complexity of testing.
There always exists a gap between the test coverage in pre-production environments and the numerous use cases served by the production environment. The real-time traffic and many use cases in production can not be simulated in full capacity in pre-production.
While developing and releasing any new feature, ensuring high quality and high availability in a short span of time with so many interconnected features/use cases is obviously challenging.

A system which can replay the production use cases and traffic patterns (in real time as well as offline pipelines) in pre-production, would help bridge the gaps. Another critical use case for the replay system is to do “shift left” during the development of new features that consume the replay transactions in a way to enable first-cut of the features itself to be nearer to the production quality, hence the development life cycles are shorter.

However, there are many challenges that exist in developing such a “replay system’’ end to end.

Challenges

These are some of the important challenges (not in order of priority):

Compliance or contract restrictions

Dealing with potential restrictions in sharing the production traffic/data to a pre-production (replay) environment because of sensitivity/privacy concerns

Creation and isolation

Creating a short lived pre-production environment and isolating it to avoid causing disruption for the production environment

Live replay

Keeping an exact copy of the state of the whole environment to run live replay simultaneously with the production environment and handling several scenarios

Offline replay

Keeping an audit of transactions executed in the production environment with all the required state information to be able to replay them in an offline replay pipeline

On-demand scaling

Overheads to ramp up/down the replay environment on demand basis

Cost

Huge operational cost incurred in running the whole replay environment

Filter criteria

Developing a rich set of criteria to filter and replay a subset of production traffic to ensure an apples-to-apples comparison and enable several “what if” scenarios that are very critical during feature development to assess impacts and positive feature flip

Replay use cases

Implementation of any new feature/bug-fix by comprehending the replay use cases in the architecture and design

Infrastructure

Upgrading any resource in the cloud infrastructure in the production environment by comprehending the replay environment as well

High availability

Ensuring a high availability of production during code deployments and other situations while keeping the replay environment absolutely in sync.

Solution

The offline or live replay system would be set up as a “parallel-production” environment. It would work like an active clone of the production environment. It wouldn’t serve the production traffic directly but would work as a parallel-live path and/or an offline pipeline. We would then overcome the challenges of potential restrictions from sharing the production traffic/data.

Nonetheless, the parallel-production setup adds to the operational cost. Whether it’s parallel-live or an offline replay, we must keep it running only when needed. Auto scaling of the infrastructure for the live replay can help balance the cost, performance and availability. For the on-demand setup, automation to ramp up and down the system will be useful. The particular cloud technology being used for the system governs the effectiveness of the automation and influences the operational cost.

While the operational cost could be minimized by sharing the downstream dependencies (database and other storage clusters and other service instances in the cloud) between the production and replay environments, it creates a single-point-of-failure situation. If the traffic is heavy on the replay path, it would stretch the dependencies and disrupt the production traffic. We can solve this problem with a rate-limit mechanism, which would address the reliability and isolation concerns too.

The cloud software stack that runs on the production and replay systems must always be identical, only then we can minimize false positives while analyzing the discrepancies in execution results across these environments. An exception would be when a new feature or a bug fix is being tested for regressions against the production use cases and the replay system would run a release candidate software version. This helps in gaining confidence about the delta code changes before they go live in production.

Usually on the offline replay pipeline we would replay only a subset of the production traffic. The ability to choose such a subset using a rich set of filter criteria is essential to execute a focused set of production use cases. This will help analyze the software stack running in the replay environment against the typical production traffic patterns and is also helpful in minimizing the false positives. Developing a subsystem that can provide a large variety of filter criteria to choose from the superset of production traffic patterns will involve an auditing storage (with detailed breakup of various parameters) and an offline and/or live segregation layer.

For the live replay use case, both the pipelines (production as well as replay) must start the execution of each call using an “exact copy” of the state of all the downstream dependencies such as database storage, other services in the pipeline, and so on. This can be solved by reusing the snapshot of the state taken while starting to execute the call in parallel. The execution time of those parallel pipelines could vary but it’s always the production pipeline response that would be awaited to return to the caller.

Lastly, the architecture and design changes for a new feature implementation, will have to keep replay use cases in mind as well. Each service would implement the replay pipeline as part of the production use cases to ensure the parallel execution aspects are comprehended. The code deployments to the production environment for the upstream and downstream services pair would have to consider the replay environment as well.

To conclude, though there are several challenges in developing the replay system, it is indeed feasible by devising the right solutions.