CQRS Pattern Success Story

How the Technology Make Impact on Business

Published in

The Legend

8 min readMay 11, 2021

https://vladikk.com/2017/03/20/tackling-complexity-in-cqrs/

We will not explain the meaning of CQRS and the best practical way to implement it at this time. Because it has been extensively discussed, both in the legend channel or in other articles. In this article, I’d like to tell you a story about how CQRS was successfully implemented in a production environment with the help of reliable tests and facts.

We will not go into detail about the company name or the platform because I have not worked there in a long time and believe that this is confidential information that should not be written down. I try to explain the information in anonymous but real data. We will attempt to present real data from the performance test on searching information service.

Platform Explanation

We developed and tested a search platform for specific business domains. This system was built around 2019, and we’re talking about the customer-facing part, also known as the query API or search API. This service is an API that is frequently accessed by the public with a large number of users every second and cannot be accurately predicted, especially when there is a promotion. This service is just as important as product search services on e-commerce marketplace sites.

The architecture in this API query implements the CQRS pattern and, as a whole, has implemented a microservice architecture with the following technology stack on the backend.

Java JDK 8
Spring Cloud Microservice (Service discovery — Eureka, Api Gateway — Zuul 2, Client-side load balancing — Ribbon, and Circuit breaker — Hystrix)
Spring Web Flux (reactive or asynchronous and non-blocking process)
Spring Reactive Kafka
Elasticsearch
Apache Kafka
Spring Reative Redis
Redis
Spring Reactive Elasticsearch (customize)

Globally, the following is the architectural software used on the platform.

Each service’s communication is asynchronous and non-blocking on average. Meanwhile, the architecture of the query API and its process flow are detailed below.

Service communication between query API with other services

According to the diagram above, the API query will read Redis or Elasticsearch (ES) data. Data on the ES will be retrieved from the main services (core services) using an event-based mechanism via Kafka.

The process flow in the query API is as follows.

Flow diagrams and back off policy for when data in elastic is not found or an error occurs

It must be a very serious concern. The back-off policy is something that is frequently overlooked when using the CQRS pattern. Almost all CQRS use a secondary database whose data is dependent on the primary database.
The method of communication between the database and the type of database used has a significant impact on how we keep our system able to provide data even if there is a delay between the main database and the database on CQRS.

Data on System

Testing is done on the three main APIs that the customer uses to search for data in our system. It’s referred to as first-page search, second-page search, and location search.

The load test tools are Jmeter, with the search script written in such a way that the search is random in every location in Indonesia on a random date. This is done so that searches do not always return results from Redis.

The elastic search data set is made up of eight interconnected indices. The data will be dispersed across all locations in Indonesia, across all dates of the year, and across all of our partners. As a result, the data in the ES is quite large, necessitating a complex search query.

Server Sizing and Specifications

Following are the specifications and sizing on the cloud server used.

Elasticsearch’s specification server consists of three nodes, each with eight CPU cores and 30 GB of memory (decreased from 60 G). Why do we have three nodes, which aren’t too many? It turns out that this is the bare minimum of nodes required to achieve data availability and security as described in the Elasticsearch-Split Brain Problem.
The specification server for Redis with 2 nodes and each core has one cpu and 3.75 G. This is the minimum availability and better setting of at least 3 nodes.
The service specification server is made up of two nodes, each with four cores and eight gigabytes of memory. To test performance with availability and horizontal scaling, we will run two performance tests on one service node and two nodes. Don’t forget to include the API gateway and service discovery in front of it.

Load Test Results

We use the java option setting for JDK 8 as follows on the server for this query API.

java -Duser.timezone=Asia/Jakarta -server -javaagent:/home/ubuntu/projects/XXX-SERVICE/newrelic/newrelic.jar -XX:+UseThreadPriorities -Xms4500m -Xmx4500m -XX:MetaspaceSize=1024m -XX:MaxMetaspaceSize=1536m -Xss1024k -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss1024k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+UseNUMA -Djava.net.preferIPv4Stack=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+CMSClassUnloadingEnabled -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -Xloggc:/home/ubuntu/projects/XXX-SERVICE/logs/GC.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses -XX:+UseCondCardMark -XX:OnOutOfMemoryError=kill -jar rest-web/target/XXX-SERVICE.jar

This java option setting is based on the analysis results for a server with 8 G of memory for reactive or asynchronous and non-blocking systems using Cassandra's default java option.

The first experiment consisted of 15-minute tests with a user count ranging from 10 to 5000, followed by a 2-hour test to assess stability and endurance. Location data is derived from endpoint location and search date for searches on the first and second pages from August 22, 2019, to the following year. Here are the results of the tests.

Real data of test results in the production system

According to the above table of test results, if we only use one instance for the seventh test, we will achieve the maximum capacity of our service with 5000 users and the CPU will reach around 90%. We can also see that our service availability and horizontal scaling are working properly from here.

The code used is good, as evidenced by a consistent response time as the number of users and use of the CPU service increases. If the system we are testing has a response time that increases but the system CPU do not increase, the written code is not good. There is a process of waiting or blocking so our service cannot utilize existing resources. (the code used is not optimal).

2. In the second experiment, each test lasted 15 minutes and involved 500 or 1000 users. The location data was extracted from the endpoint location and the search date for the first and second pages of the search from August 22, 2019 to the next one year. We also write to the ES during the read test to ensure that the tests are not waiting and blocking each other. Read and write are done in parallel and at the same time (simultaneously). Data that is entered via the Kafka listener and generated using mock. Here are the results of the tests.

Real data from the test results in the production system by reading and writing to the ES simultaneously

As long as the read and write tests are performed concurrently, the system remains safe and the performance remains constant. In other words, the read and write processes can run in parallel safely. The CPU from the ES is also still very good, only being used at a maximum of 9% or 10% of the time. As a result, a server for ES with the above specifications can handle approximately 1000 x 10 or 10,000 users.

There is one thing that needs to be explained here that even though the write process runs smoothly, we must pay attention that ES is a secondary database that is mostly used to read data. The data itself will be supplied from the main database and the data is eventualy same. This means that there will be a time lag between the main database and this secondary database. So that the lag time is very short, even under 1 second, we must adjust the process of indexing the data and refreshing the index on the ES. We will try to explain this in a separate article section.

Furthermore, the way tables are modeled on the ES, or rather the modeling and mapping index, must be properly regulated. In general, the best practical modeling of no sql can be studied in the two articles below. Design Skema Database MongoDB-Bagian 1 and Design Skema Database MongoDB-Bagian 2.

Conclusion

Perform a performance test every time we deploy the system to production, especially for public-facing services. Determine the upper limit that will be pursued so that we can improve from the system capabilities that we have determined.

The performance test will provide us with information about the system’s habits when running at high concurrent levels. Bugs typically appear when our system is being accessed by a large number of people. The fast lane expressway and the slow lane, for example, necessitate a distinct driving style.

Several metrics, such as memory usage, Redis CPU usage, Kafka lag, Kafka CPU, and Kafka memory, are not captured data, but that does not mean they are not checked. However, all of these metrics will contribute to the API response time that we get, and the test results this time are quite good.

Remember that the search for data is the front end of our system. If this section is unreliable and very slow, you can bet that our users or customers will be too lazy to make transactions and will even stay away from our platform.