Behind the scenes — scaling issues

Rob H
Altiwi Blog
Published in
8 min readApr 24, 2020

It has been a while since last post. Honestly, the main reason was I needed to take care of some operational issues in our core business, so I haven’t touched a line of code for some 4 weeks. The things seem to be stabilizing now, we are all healthy so far so I can return to development.

Anyway, I was thinking about what to share here and decided to let you look under the hood AKA — what I have been working on before that COVID-19 matter. Don’t expect any new features, this is mostly technical description how we handle things. And the topic I’d work on was console scaling. Maybe some will find it interesting, for me it is a fascinating mechanism and I am learning much every day.

If you are seasoned cloud professional, you will probably laugh at the naivity of doing things, but if you are simple IT guy, who’s curious how is it like to be on a path from prototype web application to something that scales at least to some extent, read on.

The monolith

First — the Altiwi cloud console has been developed as a monolith. Although I have some clue about all these fancy microservices architectures, “build for cloud first” and other buzzwords, I found that these are better suited for VC backed moneyburners than for a single programmer. Our team is five people, but when comes to real development, I am doing it alone, as most of you probably noticed. The ops team is myself plus one. There’s nothing bad about it, I presonally like the way it is, but inevitably it dictates some design decisions.

So the core console is being developed in Ruby on Rails (the monolith) and is running on EC2 instance at Amazon Web Services. From day one I decided to use managed database, Amazon calls it Aurora, and in the end — it is a plain MySQL that is resilient and scalable.

The reason, why I simply did not use sudo apt-get install mysql on the EC2 was that I anticipated that there will be need for some scaling and also I wanted to keep my operation workload as low as possible. I can administer the database, I can run my own server farm, I even can bake a bread and make wooden furniture but I enjoyed the luxury of cloud that I do not have to. So the database and all the performance tuning and backing-up are just given.

Do you think that there are better cloud database solutions? Like all this noSQL stuff, Mongo for example? Yes, maybe, but on the other hand if we outgrow our MySQL cluster, there would probably be enough resources to handle that. And as you will see soon, the main workload will not be put to this database.

Generally, people do not administer their WiFi that often. Sure the configuration of your networks resides in the database, but in fact, it is uploaded to the devices only upon change, so I can pretty live with the db bottleneck, as I do not see an issue anytime soon. What concerned me more shortly after we went live (and even before we went live) was the number of nodes constantly polling the console, which, in turn, fries the database. And performant relational databases are pretty expensive endeavour. I wanted to have relatively fast feedback loop so the nodes were polling the console every ten seconds at that time.

Welcome monolith with outpost

So I decided to dedicate a special one-purpose service just to handle the node checkins. As a frontend we use Amazon API Gateway that is proxied to the Lambda function that handles the checkins. And as a persistend data backend we do not use the relational database but the Amazon DynamoDB noSQL database. Whats’s nice is that this solution scales almost to infinity, as there is no servers to run. Amazon Lambda, for those unaware, is employing simple concept: Give me a piece of code, tell me the maximum amout of time it needs to handle one request, and we (Amazon) will do the rest. We will run it ocasionally on one server or we will spin up thousand servers, handling millions of requests if needed. What you will pay are GigabyteOmiliseconds :-) Microsfot calls it Azure Functions, Google has also it’s own flavour of this, it is the way massive scallable cloud computing should be done…

That solved my scaling concern for quite a long time but in short term it brought another concern — costs. Honestly, it is quite difficult to estimate your costs upfront, with no hands-on experience. I was mostly afraid of the DynamoDB costs but it turned out not to be that hot. The real pain was the API gateway, which is billed by request and if you do the math, only 100 devices polling the checkin endpoint makes almost a million requests a day, over 25 million request a month. And — that was funny, till now we could handle that just with the monolith with no additional costs!

But on the long run, this has to be decoupled from the main console, and I knew I was on the right track. The fine feature of API gateway is that it supports WebSockets. WebSockets is the modern way of building real-time interactive applications. Usually, it is used for example for chatting, or other real-time updates to web or mobile applications.

In the ancient times of web, if you wanted to get update of your app, you had to poll the server every couple of seconds. Which, in turn, mostly returned “no new updates for you”. But the damage has been done, the request was sent, the code has run and the database was queried. WebSockets, on the other hand, utilizes constant connection to the server, which is mostly idle, but can be used for bi-directional communication if needed.

So another iteration and I have implemented the WebSocked based checkin and the costs dropped drastically. The connection is held for a maximum of 10 minutes and re-established upon termination. Those ten minutes are also the emergency fallback, if anything with the connection goes wrong — the device reconnects at least each 10 minutes.

Scaling the monolith

So far so good. Now there is the time to think forward. What if I need to scale the console app? Sure, the hot place will be probably the database, but there will inevitable come the time when more horsepower for the Rails application would be needed. Traditionally, this could be handled by changing the instance type and choosing the one with more vCPUs and more RAM, perhaps with more performant disk subsystem. But it might scale up, but not that much scaling down. Basically, with this approach you are aiming always for the highest possible utilization, which might occur only occasionally, and the rest of time you are just overpaying underutilized resources.

So the key to this issue is in using the smallest possible instance (in terms of vCPU, memory, disks) that costs as low as possible. And in case of traffic spike, just add other instances that run in parallel, offloads the traffic and when not needed, terminate them.

This image tries to demonstrate the savings. If you have only one rigid instance that has defined performance, you have to pay for your maximum target load and the cost represents the blue rectangle (partially hidden behind the red area). On contrary, the red area represents the costs that more accurately represents the real demand in time. And as always — it is relativelly difficult to scale any app from 1 to 2 servers (instances) but when you solved the two problem, you have basically solved the “n” problem. Not telling that “n” is infinite, but certainly can grow significantly without any further intervention.

The nice side effect is that it will make you to create the instances in such a way that it can be build, run and disposed automatically, which is the job I was working on recently. The build and dispose are the key word here — perhaps the dispose is the more pronounced. If you embrace the concept that upon dermination your “virtual server” is deleted, it makes the other things easier. First, in case of anything does wrong and the server, just delete it and build a new one. Second — this allows rolling updates easily, just add new instance, that builds on the new codebase and then delete the old one.

All of this is possible because there is no user persistent data on the application servers. The most of the data is stored in the managed database, outside the server, so server disposal is no issue. The other part of the data are the connection-related meta data in MongoDB checkin database. Those are in sync with the main DB but in case of any disaster, it could be recreated. Anyway, the Lambda is disposable by design, it is in the service specs that you can temporarily store data locally, but on another invocation, they will not be there (It is not exactly right, those who are familiar with AWS Lambda might argue that the data MIGHT be there upon certain circumstances, but for the sake of simplicity, you cannot count with it).

There are another two types of data that needs to be handled in our case. The first, and easier to handle, are the firmware image files for your devices. From the day one I decided to use Amazon S3 for the file storage, so it is not a concern. If you do anything that is supposed to run in a cloud and you are storing files on the VM’s filesystem, you do probably something wrong. The filesystem is usually expensive and not scallable way of doing things. Object storage (in this case S3) is a way to go.

Another data, which are not visible for Altiwi’s user are the background job definitions and statuses. Anytime the console needs to do something asynchronously (be it sending and notification e-mail, or looking up the nodes to be upgraded in the right time), the job metadata is stored in a Redis database. In this case, when I implemented the async jobs for the first time, I just did the sudo apt-get install redis thing and now I have to offload it to a managed Redis instance. This is a particular part of the job I am working on right now, I already have the prototype.

The background jobs are being run at scheduled times (same on all instances) but their definitions are polled from the Redis that lies outside the application. This allows me to kill, destroy and delete any of the runing instances almost anytime. I am using “almost” because right now the Altiwi console routinely runs from one instance, so I need to add one before killing the active one, but the point is that the process has to be (and is) automated.

Conclusion

The road from simple Linux server with Apache and MySQL to a production-ready application is all but simple. I hope that this post might show to some “oldschool” IT guys to understand the concepts of cloud computing, although I am by no means an expert in that field. Instead, I tried to pick some topics that I wish had known earlier in the process. I am not telling that I was so naive that I did not know what’s ahead, more like it would remove some unnecessary scare of doing some things (like for example Lambda checkin) from day one. It’s not as hard as it looks like. Ruby forever!

Stay healthy!

--

--