NATS, Resilient Systems and Drain Mode
I believe in building easy, fast, secure and resilient systems. NATS has always been considered easy for developers, and easy for operators. It’s fast and secure to boot.
NATS has been used to build some incredibly resilient systems, like CloudFoundry. Even though NATS core is a messaging platform with multiple patterns like Pub/Sub, Request/Reply, and Load-Balanced Queues, at its heart NATS is a fire and forget system. Meaning that if there is not a consumer running at the time a message is sent, it will be like a tree falling in the woods. NATS also protects itself at all costs for the greater good, so if a consumer is too slow messages will be dropped and the client connection closed. However, if NATS is operational then the system should be resilient to change but not drop messages.
When NATS is used for Request/Reply, or RPC, it is always beneficial to have subscribers be queue subscribers. Even if there is only one to start. In NATS queue subscribers are formed at runtime with no changes needed to any of the NATS servers, and the system will automatically load balance between all members of the queue group. So if there is only one, it will function like a normal subscriber. Many architectures require additional components to be added to allow load-balanced functionality. NATS just lets you just start a new subscriber instance, configured exactly the same as all the others.
This provides instant and extremely flexible scale out functionality. For instance, the subscribers, since they are all the same, could easily live in something like an AWS autoscale group. This can be increased and decreased at will using the operational tooling and cloud provider semantics for easily adding instances.
However, when you want to scale back, taking instances offline was a blunt operation even with NATS. Requests that were in flight could be dropped. Of course, any architecture would have the requestors retry after a specified time waiting for the response, but it felt like we could do better. This is a normal operation, just as much as scaling up, and we wanted it to be equally as easy and as helpful as it could be.
Many technologies have the concept of lame duck mode (via Google) or Drain states. So with the latest release of the Go NATS client, we introduced a Drained state as well for subscribers and connections. Other clients will quickly follow as the team rolls out updates. So instead of simply closing the NATS connection, or just exiting, developers can now place the connection on a drain state by calling Drain() and waiting for the connection to auto-close or timeout. This way all in-flight messages will be handled and no requests will be dropped.
Here is a small example of a queue worker. You can run as many of these as you want and signal them to terminate at will. No requests will be dropped even with a high ingestion rate using Drain Mode.
I will continue to post more things about NATS and how NATS can be a critical part of any secure and resilient distributed system.