Traffic Simulation — The practical way to test systems
Traffic Simulation (aka shadowing) is simply to replay the production traffic patterns to the test cluster or a new cluster containing the new “deployed” version of service. The traffic can be a real time where the traffic is replayed in “deployed” and “released” version simultaneously and can be “asynchronous” where traffic is recorded and replayed.
According to Principal Of Chaos Engineering, it says:
Systems behave differently depending on environment and traffic patterns. Since the behaviour of utilisation can change at any time, sampling real traffic is the only way to reliably capture the request path.
Any strategy in quality engineering including better test coverage, covering corner cases, covering security tests, doing load tests and so on will not suffice if we do not understand how the product or services are being used. It is the fact that how system would behave with real production traffic is always a mystery, and it become more complex in distributed system and containerised environments. Adopting traffic simulation can be a great way to enhance confidence in new and existing code base against production traffic.
Having said that, as we cannot just move fast and break things, more and more fast pace teams try to adopt it in one or another way.
Following are two primary use cases (however there are more) of traffic simulation are:
- Functional — Use cases like analysing and replaying the functional path of the services to the new code base over testing cluster or any replica cluster.
- Non-Functional — Use cases like replaying the production traffic at the throughput of 2x to reveal how system would scale with 2x traffic.
As Doris Lessing says, “Things are not quite so simple always as black and white”, so its adoption comes with its own complexity and that depends upon the type of complexity we have with target system. Some system with less backend computations are simple to simulate than having the system with complex business configurations, logics and processing. Some open questions to address while adopting traffic shadowing are -
- Security concern in processing production data like customer information and so on. Should we filter the production data?
- If service updates the data then how to isolate those changes and not impact the state of system?
- How easy it is to create a production like environment with DATA?
- How do we contain our test cluster to not interfere with live collaborator services?
And so on, above are the real challenges faced by the teams and can be the reason to NOT to try traffic shadowing.
In Capillary Tech, we invested time & effort in writing the custom traffic shadowing tool that cater the above challenges and moving in direction of “Testing without Writing Tests”.
Note — Idea is NOT to get into code or implementation but to discuss the concept.
Goal is to serve both functional and non-functional use cases. Since it is quite bulky to add pre-production environment with the data in our scenario, we use test cluster as a target environment for traffic simulation. I know, this comes with even more complexity and that is why we develop a custom rule engine that manipulate and evaluate the urls data, its request content, headers, method and params and last but not the least the configuration based on the target environment. Let’s discuss the flow diagram as below -
Flow would be like -
- Puller is responsible to pull the server logs from any intended production cluster, we have different production cluster for different brands in a region. So the traffic patterns are very different in different clusters. Note — all the production data is encrypted and secure in that sense.
- Log parser will extract the meaningful information from the pulled logs and create an intermediate server logs that is ready to be replayed. This is multi-threaded component and can be initialised with intended concurrency.
- Rule engine will work on the parsed production logs, and it is responsible to manipulate and evaluate the captured data from logs without changing the structure and pattern of calls. There is also an inner component of rule engine that pre-set the configuration and data to the target system.
- Simulator will simulate the traffic to target system with changed data and passed to response parser, which eventually parse the response. This is the multi-threaded component and can be initialised with intended concurrency.
- Logger will log the data to ES and file system for further analysis.
Following are the glimpse of the ES data for the replay of one of the production logs. You can analyse your results to the new codebase against the production traffic. The different colour represent the different status of simulated traffic calls (internal to capillary system).
The traffic can be replayed with 2x or 3x throughput on the target clusters to analyse the capability of servers.
The simulator is intended to replay the traffic with the same delays and pauses as in production logs but with added concurrency, however if required, to serve functional goals the delays can be ignored and replayed with higher pace.
For instance, with current optimisations we are able to parse 20GB of production logs in ~10 mins and could replay it in 45 mins with the pool size of 10.
The deployment typically spawn a container with custom docker image (uses python), and the execution is handled via a CI job that pulls the logs at run time, process and replay the traffic. The CI job can be controlled via config params like “log file, pool size, log to es, monitor downstream services and so on”.
Type of bugs QE team generally able to catch via simulations
- Bugs related to some specific configs, that is not part of any regressions.
- Bugs on backend services error while processing special combinations of payloads. Some examples like — For instance if amount in transaction is resolved to such a value that breaks the service for that call. AND Point redeem fails when programs passed to it and so on.
- The list is more, but the point is revealing the production pattern in your test environment will reveal lot of hidden bugs in the system.
Next roadmap for QE team is to add “tap compare” flavour to it, that could finds potential bugs in your service by running instances of your new and old code side by side. It behaves as a proxy and multicasts whatever requests it receives to each of the running instances. It then compares the responses and reports any regressions that surface from these comparisons. The premise for “tap-compare” is that if two implementations of the service return “similar” responses for a sufficiently large and diverse set of requests, then the two implementations can be treated as equivalent and the newer implementation is regression-free. Scala based “diffy” is one solution that is the perfect example to support this.
There are various tools and solutions that supports traffic shadowing, but based on my analysis none can be used directly as it is. You have to think of any customisation to cater the complexities with production systems. Few examples some of the references are -
I hope the above discussion add some value and encourage teams to adopt traffic shadowing inspite of having discussed challenges.