How we scale using AWS
I have heard many rumours about hosting a website or platform on AWS. Many saying things like ‘AWS is not as efficient as ‘old-school’ hosting and/or a vertically scaled server somewhere in your server room…’ I’m not going to argue this point because it would be akin to arguing with MAGA voters. So, let’s ignore these haters and take a direct look at why AWS saves you money and offers more power, performance, scalability, reliability etc. than anything you could put together yourself — it’s the reason why everybody is moving to AWS.
To understand why this is the case, let’s take a direct look at some examples using our platform, AUTOPROP.
Requirements: we have a number of pages in HTML (<100) and we would like to convert and provide these to users in PDF format in a reasonable time.
Today, we are on our third iteration of this concept. The first leveraged the client-side — Chrome PDF print and print CSS stylesheets — and it was unreliable and, to be completely honest, a complete fail.
We are now on our third implementation which works very well. We use with AWS Lambda (serverless) and Headless Chrome, because it has all the advantages we were missing on both old implementations and something extra…
Let’s break it down. Again, we have 1–100 HTML pages on the input and we need to return a PDF containing all of those, properly formatted, styled and in a reasonable time.
The naive solution (which does work) was: use Headless Chrome to render all of those on the page and then save it as PDF. The trouble here is that Lambda has a time limit and maintains a (pretty much) fixed time for the conversion.
So if a user submits 1 page, then they would get results in 2 seconds, but for 100 pages, we are talking about 200 seconds, ~3 minutes, which is both slow and costly.
The solution we picked has a magic moment of thinking differently. When we get set of pages into our function, the function lets other functions to do the conversion in parallel and grabs the results, combines them, then returns them to the user. Because of parallelization, it has almost a fixed time of conversion no matter how many pages user submits. Also we found, it takes way less in terms of time/resources/$ to get one big PDF using this approach.
Not having ability to use these functions means you would need to have a dedicated machine for this- a print server of sorts. That machine would cost $/mo, need to be running all the time, and you would be limited to number of processes you could run simultaneously on that machine, which in terms of Chrome browser is rather low. Also the machine would be mostly idle doing literally nothing.
The second example is actually our core product’s value propositio:. We have workers which computes all different things for our users, on-demand, real-time. Unfortunately these workers often need network. That means the behaviour is rather unpredictable which in the end means, it’s hard to have right amount of workers and we could easily get flooded by the amount of requests, especially when some un-expect behaviour appears. We already doing maximum to minimize those, but frankly, every day is different, especially when you rely on 3rd party providers.
To solve that, we need to understand a bit more:
First we have an AMI — Image of the machine. The easiest way to create an AMI is to: create new EC2 instance (machine), install all the libraries we need and then, from the AWS website, right click the image, and select create Image, and AWS would do all the work for us. But you can also automate this process easily.
Once we have the Image, we can create a LaunchConfiguration (LC). This recipe for AWS, telling it which machine, under which conditions, with what security groups or name we want to have. We can also tell to AWS what script they have to run, once the machine is on (e.g update my code :)). LaunchConfiguration could be created via API or Web-Console. Let’s say we picked console and created new LC — using our created AMI, with all settings we would need for new server to run.
Next, we create an Auto Scaling Group — that is a virtual group which holds a number of machines (where we have LaunchConfiguration for those, to tell AWS what kind of machine with which settings we need) and which could react on events to expand or reduce how many server are up. Let’s create an empty group with our LC.
But first, before we set any of those rules, we need to decide if we would like to scale by time or metric. We could easily set (see scheduled actions in our group) that every day at 9 am the group should have 2 servers and at 5pm it supposed to go to zero again. Imho that’s the first step and we had that one for while too — to get familiar with this. We saved a lot of money by turning on a few “spot” instances to cover peaks in demand of our clients. Actually we still have this and it is great way where to start.
But why not have this reacts dynamically to actual user needs? Instead of on an alarm clock?
Time to go deeper, right? :)
Our ASG can react to events. That means, we could tell it when we need a new instance, or when we should get rid of one. There are more options, but the most useful is to use Alarm in CloudWatch. CW helps us to keep track of all metrics we have. We have all CPU usages, Memory usages and most importantly our custom metrics. There is a script which tells us (CW) every minute how many “slots” are available (on all worker machines) and how many slots are occupied (worker busy). From that we created a nice dashboard, but more importantly — we have a dynamic Alarm we can reacte to.
Let’s say we have metric of CPU Usage, and we set the alarm, which triggers when usage exceeds 80%. We can wire an action on ASG to that alarm.
When the alarm is triggered, we want to expand our ASG by one machine. Very important rule there is, scale up early. Our rule there is, when we reach given limit, immediately trigger alarm to add two new machines. I completely agree it’s quite aggressive, but it is better to have capacity than not have any. Also we are scaling down way slower than we scaled up. That means we use an opposite alarm (unfortunately AWS needs to have opposite alarm for scale down actions or so) which triggers when 15–30 checks are ok. Than means when everything is back to normal (below our limit), we wait a few minutes to makes sure it indeed is, and only after that, we remove one or more machines from the group.
In the end that means, when we have the first alarm — # of tasks actually running there is more than 80% of our capacity, we immediately add new machines. When everything is ok (<80%) for 20 minutes, then we remove machine.
And that’s it. If you are starting with AWS I could just recommend to start, experiment and mess with that for a while (AWS has free tier). If you do something similar, feel free to share you approach in comment :)
Also If you are interested in more stories like that, follow me on Twitter :)