Tokopedia ecosystem is powered by micro-services architecture using cutting-edge technologies. A plethora of web services are developed and deployed on daily basis enhancing the user’s experience.
Service Reload, thus, is an integral part of the deployment process and the smoother the process is, the easier is the life of the engineers. Over the years, multiple CI/CD tools have been developed to automate such deployment process and are adopted as per the requirements & expertise levels.
It is easier to attain the graceful reload of the HTTP services but the same is not true with NSQ consumers. The NSQ consumers stop as soon as they are reloaded and a new consumer spawns which then starts consuming the messages.
Now, the real question is:
Why do we need to perform the graceful shutdown of consumers? Aren’t the messages going to be handled by the newly spawned consumers?
The short answer is Yes but it does not have any impact only if the consumers are performing tasks that can be re-processed as per the business needs. However, it is not the case in most real-life situations. There are cases where the actions performed by the consumers need to be atomic or are scenarios where real-time processing is intended.
What to do?
System signals come to aid to achieve the graceful shutdown.
When the service reload command is triggered, the system sends SIGHUP signal to the workers/consumers. Consumers are listening to this signal and waits until the handlers have processed the message in-flight. However, the daemon starts another consumer as well with the reloaded configuration.
When the signal is received, consumers perform the following actions:
1. Request handlers to not process the new messages
2. Wait until all the concurrent handlers have completed their processing
3. Stop the Consumer Gracefully
The concurrent handlers are configured to maintain the global count of handlers processing the messages. DoProcess maintains the state to decide if the message received needs to be processed or re-queued.
When all the handlers are done processing their messages, a notification is sent to stop the consumer.
The above implementation helps to perform the Graceful Shutdown of the consumers but leads to another complexity. When the messages are re-queued to NSQd, they are moved to
Deffered state and wait until the consumer is stopped. They are not being instantly picked up by another consumer.
To resolve the sticky behavior,
MaxInFlight for the current consumer is configured to
0. This informs NSQd that the consumer is no more able to receive any new messages and the messages are consumed by other consumers.
This implementation helped us to reload the consumers and process the requests received from clients in real-time without impacting the user experience.
NSQ Consumers are the background jobs performing a plethora of jobs. This implementation helps to keep the sanity of the already running processes while loading the new binaries.