A Lesson from History
Have you ever felt like you just needed another “You” to get everything done in time? That’s how we, at Carnot, felt when our customer base increased at a rate, way beyond our expectation. It was like taking on a sea swimming challenge just after finishing your first lessons with floaters.
For the backend team, it meant having to support 20 times more traffic without impacting customer experience in any way. It was a challenge and it did not help the fact that our entire backend team consisted of only three people. But you know what they say — “No pressure, no diamonds!”
We learned a lot of empowering lessons from those demanding times. One such lesson was — Auto Scaling; which is arguably the most important cloud management tool.
In the early days, when we knew little about handling traffic peaks, we had a monolithic architecture. A socket server would allow our IoT devices to connect & send data, which was then queued in a Redis cache until it got processed by our main server.
As our clients increased, the entire cloud pipeline was affected by the slowest component. The Redis cache which was used only for queuing soon had a large queue and ran out of memory.
At that time, not upgrading the cache for better memory specification would have meant losing valuable customer data. Hence, we selected the Redis plan so as to support data retention in the queue even during the peak traffic time.
The system was stable with this. The only problem was that we were paying a lot more than what we were utilizing. On deeper analysis, we found that the so-called “peak traffic” would occur only during certain hours of the day. 75% of the time, the extra memory opted, was unused. But, we were still paying for it.
Frankly, “auto-scaling” was a buzz word for us at that time. We had read a lot about it and understood that it helps to automatically detect your traffic & scale your cloud components. We knew that we absolutely needed this. But we could not find an easy plug and play solution for our scenario — Redis scaling on the Heroku platform.
So, we decided to do what we do best at Carnot — Start from basics and build your own in-house system
What we needed was a mechanism to
- Monitor the key metrics for Redis health check
- Detect incoming traffic (or load) on servers
- Create a cost-function to get the cheapest possible plan for current traffic
- Change the plan as detected by the cost-function
Well, that’s what we built for one Redis, then the other and another; until we created a standalone plug-and-play solution of our own — for any Redis on Heroku.
We have recently made this entire system open source. If you are facing a similar issue, feel free to set it up in your account and share your experience. Any issues/suggestions to improve the system are most welcome.
Broadly, this is how it works: We selected the two most important metrics for monitoring the Redis health — Memory Consumed & Clients Connected. A scheduled cron at a pre-defined frequency collects and stores these metrics for all enrolled Redis caches. The traffic detector indicates how close are we to exhausting the limits of our Redis. The cost function, then, takes in the key metrics along with traffic indication and provides the cheapest Redis plan to support the traffic. Finally, we use the all-powerful Heroku platform APIs to change the Redis plan dynamically, whenever required.
After sugar, spice & everything nice and an accidental chemical X, we are now capable of supporting over a million IoT nodes. We have definitely come a long way from those early times and learned a lot along the way.
As we grew, most of our systems moved away from Heroku (to AWS) in order to save costs. But, for any start-ups/developers out there who are dealing with around thousand client nodes and use Heroku as their preferred choice of platform (for its obvious ease of set-up), we hope this blog would be helpful.