Capacity Planning For Your IT System and Cloud Service

Be Better Prepared (Now) Than Sorry (Later)

Continuing my previous note about Capacity Planning for your IT System, by this writing I would like to share some of basic principle in Capacity Planning and why does Capacity Planning is important.

What is IT System Capacity Planning ?

Before climbing the wall of the building block called Capacity Planning, we should be knowing about what the capacity planning is. Simply said, Capacity Planning is activities to calculate resources of your IT System so that the system can handle such amount of data and survived the highest peak of such number of transaction concurrently — which mostly measured by tps (transaction per second) within certain period of time span. The capacity planning would certainly cover calculation of computing power — number of CPUs (CPU Type, GHz), capacity of data storage (Terra Bytes of Disk), capacity of memory (GB of RAM), network bandwidth (Mbps, Gbps) and may be if you’re quite lucky IOPS of Disk.

How to do IT System Capacity Planning ?

There are several ways doing it. In short, it should be based on facts (data) as well as some amount of assumptions. To be more scientific, it would also require mathematical calculation —not a rocket science neither guessing or picking some fancy numbers out of the sky. Assumptions ? Yes, of course. Within most of cases, assumptions will be required and it is obviously different by random guess. What are they ? If you are patience enough to read this, you can find it on the following section.

Traffic Projection

This is the biggest portion of assumptions that I mentioned before, especially during your first step to the moon, the very first time span you launch your IT System. Who provide it ? Usually our friendly Product Manager would be the one who provide it. In order to send it to the IT Team, usually a good Product Manager will need to enter their kitchen and cook all the required ingredients such as their knowledge and experience on similar product on the market, enriched by the business target on certain level of achievement on how many users expected to enjoy the newly (re)launched service, how active they are upon using the service or system, and the assumed growth rate (usually monthly). Here’s one example showing it.

Data and Facts

How about data and facts then ? In parallel, a wise man from IT would be spreading their wisdom by asking you to do several things such as : conduct load / performance / stress test on your API, Web Pages, and Static Content — of course you will need to record it somewhere as the result will be very important; then collect the specification of your test server(s) and also (soon to be) production server(s). To be more accurate, another wise man would told you to grab the result of benchmark from some tools or to be more easy take it from spec.org site.

Here’s one load / performance / stress test sample :

Here’s server configuration sample :

And Here’s the data grabbed from spec.org site

Technical Assumptions

Patience, the exciting part will be on the next step — not this one. A seasoned IT Analyst or Architect would be also adding some assumptions to be considered, such as : how many percentage is to be used as constraint of CPU Utilization, average content size, average request and response packet of API from dozens of your content and API. Why constraint of CPU ? The answer would be CPU Utilization vs Load is not linear, and also most people will need a buffer of computing power in order for the system to handle such spike of request during unexpected circumstances.

Calculation

Now come to the exciting step, the “Calculation”. Let’s rewrite the information from the above :

By End of Year :
- Subscriber target is 18,634
- Peak API tps will be 187
- And peak CDN tps will be 466.
Assumption made : Average size of static content in CDN is 250KB

From the load test data, your test server can handle :
- API Server up to 40 tps when CPU utilization will reach 70% — per node.
- CDN Server up to 200 tps per node
- DB Server up to 100 tps per node

By simple math you can calculate that number of server required will be :
1. Number of API Server : roundup(187/40,0) → 5 server
2. Number of CDN Server : roundup(466/200,0) → 3 server
3. Number of DB Server : roundup (187/100,0) → 2 server

Hey, what about Network Bandwidth ? Well, in most case Network Bandwidth would be valid for CDN.
So, the bandwidth consumption will be = 466 * 0.251 MBps = ~ 117 MBps ; which is equivalent to 935 something Mbps.

So, is it safe only to use 3 CDN Server ? Not quite there yet. Why is that ? Here’s the story. If you’re deploying your system on the cloud, you may want to check your (local) cloud provider bandwidth. Some of those mini-sized cloud provider only give you 100 Mbps. In this case, you should be requesting for additional bandwidth or the worst case is to migrate into another cloud provider whom can provide sufficient internet bandwidth — with acceptable TCO. Otherwise, if you’re insisted to put it on the existing cloud and refer to the previous traffic projection, your new IT System or (cloud based service) most likely will hit the rock of infrastructure limitation by the first month you launch it.

Hey, how about the difference of specification in terms of CPU between Testbed and Production Server ? Well, you may use simple math to factorize the performance capability between those two server. Let’s say the :
- Testbed only able to serve 40 tps of API Call
- so the Production Server theoritically will be able to handle : = 40 * (52.3 / 39.1) = ~ 53 tps

So, if the hardware is using the above specification, for production server you can reduce quantity of server required from 5 server to 4 server.

How about the other factor such as DB Storage, Log Storage ? Well, it would be much more fun if you’re willing to give it a try, do the simple math and conduct the calculation by yourself. Bon appetit, mate !

Did you find it simple ? If you’re still confused, contact me on my twitter : @arydewo