Amazon AWS Cloud

Lessons Learned So That You Don’t Make the Same Mistakes I Made


I launched my first instance on the Amazon AWS cloud back in 2009. And today (July 7, 2014) I manage close to 400 instances of all types. We are 100% AWS cloud deployed. Our entire infrastructure (load balancing, dns, databases, web servers, app servers, Hadoop, Vertica, auto scaling, monitoring, etc) relies on it. Here’s a comprehensive list of lessons, gotchas, and rules of thumb I’ve learned over the years. I’m sharing this with you, so that you don’t make the same mistakes I’ve made!

The purpose of this document is to share with you what you need to do and less about how to do them. There are plenty of great online resources to help you with that. I hope this document provides a good reference for you to know what to Google.

This is going to be more of a random and perpetual list. I’ll continually add to this list as I think of more things to share.


Table of Lessons

  1. Know your limits!
  2. You’ve been profiled!
  3. Pre warm your ELBs (Elastic Load Balancer) unless you are cooking casseroles.
  4. Don’t forget to assign IAM role to your instance when launching your instance!
  5. Stop stealing my CPU!
  6. Don’t declare too many rules to your EC2 Security Group! Mysterious things can happen.
  7. Your logs are gone when you need them the most. Centralize Them!
  8. Create IAM user accounts for your users.
  9. Use the free storage space your instance comes with. It’s free! And it’s fast!
  10. Tag, tag, tag! Tag them instances!

Know your limits!

Your default AWS account comes with a set of limits and restrictions that you never knew you had. And almost always, you’ll find out about it when you’ve actually hit your limit. This is bad news if you have urgent need to launch additional web servers or have extra storage requirement and you can’t because you’ve reached your limit. This is especially bad if your environment is automated and your automation fails. This almost always happens in the middle of the night on a weekend after you’ve had a drink or two. And if you don’t have premium support, then it can be a long wait until your limit gets increased. So plan ahead and make that request to increase your limit in advance.

Here are some important ones to be aware of:

  • Limit on running on-demand EC2 instances: 20 Once you’ve hit this limit, you will not be able to launch anymore instances.
  • Limit on EBS volume storage: 20TB Once you’ve hit this limit, you will not be able to add and attach anymore EBS volumes. Even worse, you’ll not be able to launch anymore EBS root instances.
  • Limit on the number of S3 storage buckets: 100 This is a hard limit and can’t be increased. So don’t be too trigger happy with creating buckets. Come up with a good strategy. A stupid strategy would be creating date buckets. Remember that you can always create sub-buckets inside your primary bucket.
  • Limit on the number of Elastic IPs: 5 If you are going to have more than 5 public facing machines that need static IPs, then make the request to increase this limit. But this is less critical. Although it’s a hassle, you can always update your DNS to point to the non-static IP address assigned to your instance.

As of July 2014, Amazon has added the “Limits” page that you can access from the EC2 dashboard console screen. You can access this page to see all of the active limits. And they made it easy to make the request to increase your limit. Thank you Amazon!


You’ve been profiled!

Amazon capacity team is always monitoring resource usage across all accounts in order to ensure resources are available for their customers. And sometimes this means taking the extreme measure. In other words, they screw you so that others don’t get screwed. What they do is they secretly profile you based on your usage pattern. They categorize their customers as new (noobs that they can care less about), steady (good behaving customers), and spiky (bad customers). You really don’t want to be on their bad side. Spiky customers are those that occassionally launch and terminate many (100+) instances in a short period of time. And if you get categorized as spiky during what they call their “safety capacity period”, then you’ll be prevented from launching instances. This safety capacity period is completely internal, and they don’t publicly announce when they get into this period. So how do you get on their good side? Reserve your instances. If you plan to have spiky moments, reserve that many amount of instances. But this isn’t always practical. You don’t want to pay the upfront fee if you don’t plan on fully utilizing that many instances for a long enough duration to recover your cost. A more practical approach would be to contact the AWS support and let them know in advance. Then they may be able to increase your buffer/threshold level.


Pre warm your ELBs (Elastic Load Balancer) unless you are cooking casseroles.

You’ve deployed your web servers behind an ELB and configured everything to auto scale using Amazon’s Auto Scale and CloudWatch. Now nothing can stop you. You are ready for millions of people coming to your website, right? Wrong! Amazon’s load balancers aren’t designed to handle sudden traffic increase. If your traffic growth is steady and gradual, then no problem. But if you are running a new Super Bowl commercial and expect sudden increase to your traffic, then your ELB will most likely choke and your customers’ browser sessions will time out. The solution is to contact the AWS support in advance to have them pre-warm your ELB. When you make this request, AWS support will assign additional ELB endpoints to your ELB to be able to handle the anticipated spike. They will need some information from you to determine how much to pre-warm your ELB. Be prepared to give them the details like:

  • # of anticipated requests per second
  • average response size in bytes
  • SSL / Non SSL traffic ratio
  • Start and end time of your anticipated spike. But if your spiky activity is frequent and unpredictable, they may be able to permanently pre-warm your ELB.

Don’t forget to assign IAM role to your instance when launching your instance!

This is so very important, because once your instance is launched, there’s no way to assign an IAM role to it. It’s too late! It can only be assigned as part of your instance launch. So what is this all about? Why do I need to assign an IAM role to my instance?

Most likely you’ll end up needing to make AWS API requests within your running instance by your applications, cron jobs, scripts, ad-hoc command line requests, etc. You need to provide AWS credentials (i.e. AWS access key id and secret access key) to do that. If you do not assign an IAM role to your instance, then the only option you have is to hard-copy this key pair somewhere on your instance (i.e. config file, or within your script/application). Imagine your instance getting hacked into. Or a more likely scenario would be a disgruntled employee who has access to the instance.

Do NOT store your AWS credentials on your instance! By assigning an IAM role to your instance, you and your applications will seamlessly inherit your instance’s IAM role and be able to make the API calls without having to provide the AWS credentials.

Even if you don’t have a need to make API requests now, just create a new IAM role (leave the rules blank for now) and assign it to your instance. As needs change, you can add new access rules to your IAM role.

Don’t create a default IAM role and use it for all your instances. That would be the second dumbest thing to do (first being not assigning an IAM role to your instance). Assign a unique IAM role per instance (or per cluster of instances with same roles). You don’t want ever want to assign more privileges than your instance would ever need. Always practice principle of least privilege.


Stop stealing my CPU!

Instances you launch are comprised of varying combinations of CPU, memory, and storage capacity depending on the instance type. When you launch an m1.large instance, for example, it comes with 2 virtual CPUs. However, the actual physical machine where your instance is hosted may have the capacity of 10 vCPUs. That means the machine may be capable of hosting 5 m1.large instances.

When you run “top” in VM environment, you get what’s called STEAL (or sometimes called STOLEN) percentage represented as “%st”. When you have a positive STEAL value, it means that your instance is in fact MAXing out its CPU and it’s over-requesting by that much (STEAL %). But because the physical machine is also maxed out, it’s not able to give you any more CPU than what your instance is assigned to have.

If a physical host is under-utilized (i.e. maybe the machine has 10 vCPUs with only 2 actively running m1.larg instances), then it can actually give you more CPU than what your instance was contracted to have. Yay, free CPU! So next time you do a “top” and consistently see high STEAL value, then it’s indicating two things:

  • your instance is working very hard (at least in terms of cpu usage)
  • the physical machine is also fully maxed out.

It may actually make sense to stop and start this instance and hope to have it launched on a different physical machine that’s under-utilized. Then we can “STEAL” more CPU from the physical machine.

But do note that if you are consistently seeing high CPU STEAL value no matter how many times you relaunch, or if all of your instances in the same cluster are consistently showing high CPU STEAL value, then it’s likely that you are, in fact, the cause of the physical machine maxing out the CPU resource. Then it’s time to start thinking about upgrading your instance to an instance with more CPUs.


Don’t declare too many rules to your EC2 Security Group! Mysterious things can happen.

Amazon’s Security Group acts as a firewall that controls inbound connections (and outbound for VPCs) to your instances. You assign one more or security groups to your instance when you launch your instance. You can add one or more firewall rules to your security group. But don’t add too many. The rule of thumb is to not add more than 75 (or 100 as some AWS support will tell you). When you have too many rules, you may start experiencing random and unexplainable network connectivity issues. This will almost always result in hours and hours of troubleshooting with and without AWS support and not come to a conclusion. Just take it from me. Do not declare more than 75 rules in your security group!


Your logs are gone when you need them the most. Centralize Them!

Log centralization becomes particularly important in the cloud. Typically when you want to look at system logs or application logs, it’s to investigate some sort of failure. But more often times than not, when there is a failure on a particular instance, that instance becomes unavailable. They can get terminated by the auto scaler, or the machine just goes dead. These are commodity machines and you expect them to be short-lived, and you need to design your infrastructure to deal with failures. Or maybe you still have access to your instance. But if it’s part of a big cluster of 100 instances, identifying the source instance could be a big hassle in itself. Centralize your logs! Stream your logs into a centralized location. But you are not done yet. You need to also provide easy means for your developers to search through your logs.

This is an important subject. So let me dig deeper and provide you with some resources to get your started.

  • rsyslog / syslog-ng— It’s possible to have your apps write logs into syslog. And you can setup your syslog to forward logs to a remote syslog daemon. Here’s a nice article that talks about that. http://www.linuxjournal.com/content/creating-centralized-syslog-server
  • Splunk — Splunk is a SaaS (Software as a Service) product that you install at your premise. This is an extremely sophisticated and feature-reach log analytics software. But if you generate lots of logs (multi gigabytes per day), then be forewarned that Splunk can get pretty darn expensive very quickly.
  • Loggly— Loggly is a cloud-based hosted solution started by the former Splunk employees. The setup is easy. You can have your applications streaming your logs into your Loggly account in just minutes. It’s cheaper than Splunk, but it still can get pretty expensive if you stream multi gigabytes per day.
  • Logstash / Elasticsearch / Kibana— This is my personal favorite solution that I’ve deployed into our environment. It is open-source, robust, and feature-rich. The data flow looks something like this: Logs -> Logstash Shipper -> Kafka -> Logstash Indexer -> Elasticsearch -> Kibana. So we have logstash shipping agent running on all of our ec2 instances tailing all of the system logs and application logs and streaming them over to Kafka (our message broker). In the eyes of Kafka, these are the publishers. On the other side of Kafka, we have logstash indexers that consume the log data from Kafka. And they’re responsible for doing all log filtering and indexing into Elasticsearch, as their final destination. And we use Kibana (web user interface) for running searches. Today we centralize about 2 terabytes of log data per day from 300+ instances in near-real-time. This is already a lot more than what I wanted to cover in this article and deserves to be a standalone article itself. If/When I do write a new article on Log Centralization, I’ll update this article with a link.

Create IAM user accounts for your users.

Don’t share your master login account. In fact, as a master account holder, even you shouldn’t use your master account to login. What if your master account password gets stolen? Look what happened to Code Spaces. They recently went out of business after their AWS account was hacked. The perpetrator deleted every data they had owned on their AWS account. For any user that needs to access the AWS Management Console, do the following two things:

  1. Create IAM User account for that user. And give the least amount of privilege the user needs. Do not enable access key unless the user needs to make API calls. But technically, they shouldn’t need one as long as you’ve assigned IAM roles to your instances. They should be able to SSH into your instance (that has the right set of IAM privileges) and make the necessary API requests.
  2. Enable MFA (multi-factor authentication). It’s extremely easy and costs you nothing. Users can install a virtual MFA device on their smart device. Authenticating the device takes less than a minute. You can go to this page for a nice documentation and a video tutorial. With MFA enabled, your users will now need to enter both the password and the authentication code the MFA device generates.

Use the free storage space your instance comes with. It’s free! And it’s fast!

Every EC2 instance (except for t1.miro) comes with varying amount of ephemeral/temporary disk storage that are physically attached to the host computer. Amazon refers to this storage as instance store. An instance store is considered ephemeral or temporary because once the instance gets terminated or stopped (if you are running on EBS boot instance), all your stored data in an instance store gets deleted and recycled. So it isn’t an ideal use for storing any permanent data. You should use the EBS volume for that. But it’s plenty of storage space for your /tmp folder or your swap partition. At this point, I should run a quick comparison between an instance store and an EBS volume.

Data Persistence

  • EBS volume: permanent. persists even after the instance termination (unless you’ve enabled the volume to be deleted when instance is terminated)
  • Instance store: data is lost when instance is terminated or stopped

I/O Performance

  • EBS volume (magnetic): typically slower than instance store. inconsistent at times due to extra network connection required to connect to your storage.
  • Instance Store: faster than EBS volume. consistent. But for optimal performance, you should go with EBS-optimized with provisioned IOPS.

You should also consider using instance store as a database storage if you have a cluster with enough replication factor (minimum 3) to ensure fault tolerance. This way if you lose an instance or two, your cluster as a whole can still function with no data loss (to date, we’ve never had instance failure of more than 2 nodes at any given time). For example, you can setup a hadoop cluster of 100 nodes with RF=3 running on m1.xlarge instance type that comes with 4 x 420 GB (1680 GB) of ephemeral storage. That’s over 164TB of free storage! If you had to purchase that much storage on EBS, it would cost you over $9,000/month. And it would be slower (unless you spent the extra money to purchase EBS-optimized with provisioned IOPS).


Tag, tag, tag! Tag them instances!

Your should tag relentlessly. Tag all your instances. Tag all your EBS volumes. Tag all your RDS instances. Tag all that you can until you can tag no more. Most AWS resources allow you to provide your own metadata as a tag in a form of key and a value. This is useful for properly identifying your resources and categorizing them in different ways. I still don’t get it. How is this useful? Let me count the ways.

Be able to humanly identify your resources

Would you rather have this?

Or this?

By using the “Name” key tag, you can name your individual AWS resource. As a default, the AWS Management Console will display the values of your “Name” tag. With their values, you can clearly identify your resources. Without it, the only thing you have to go by is the cryptic Instance ID.

Better integration with 3rd party softwares and services

Most 3rd party AWS integrated services (i.e. cloud management solutions, monitoring solutions, etc) will take advantage of your tags to intelligently label and categorize your resources. Without it, you have to do them manually to produce any type of meaningful reports and analytics.

Produce meaningful cost allocation reports

Tags become extremely useful for allocating your AWS costs. You can create custom key-value tags to associate your resources by owner, department, products, purpose, etc. For example, I can create the following two Keys: “Product Name”, and “Department”. And I can assign the values for these two keys for all of the AWS resources. Then you can use the AWS billing tool and configure it to use the tags to organize your AWS bill to reflect your own cost structure. This way you know how much is being spent to support your Product A, and how much is being spent by Department B. You an find out more about cost allocation and tagging by going here.

Tagging doesn’t necessarily have to be a manual process. There are plenty of ways to semi-automate this.

Beware. They’re not the same.

Let’s say you have 10 m1.large instances. Are they all the same? Will they perform the same? You would think so. But chances are that they are not the same and will perform differently. The difference is the CPU. Keep in mind that Amazon capacity team is constantly making changes: purchasing new machines, and replacing and upgrading old machines. Chances are that host computers will have many different types of CPUs ranging from older to newer models.

Here’s an output showing the ‘model name’ line of /proc/cpuinfo using capistrano (pst! Capistrano is a great tool for running ad-hoc tasks if you have many instances).

cap> with web_cluster
scoping with web_cluster
cap> grep ‘model name’ /proc/cpuinfo ** [out :: 10.117.58.42] model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz ** [out :: 10.22.121.34] model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz ** [out :: 10.22.121.34] model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz ** [out :: 10.20.124.32] model name : Intel(R) Xeon(R) CPU E5507 @ 2.27GHz ** [out :: 10.40.124.32] model name : Intel(R) Xeon(R) CPU E5507 @ 2.27GHz ** [out :: 10.54.33.110] model name : Intel(R) Xeon(R) CPU E5507 @ 2.27GHz ** [out :: 10.214.119.35] model name : Intel(R) Xeon(R) CPU E5507 @ 2.27GHz ** [out :: 10.214.119.35] model name : Intel(R) Xeon(R) CPU E5507 @ 2.27GHz ** [out :: 10.7.6.62] model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz ** [out :: 10.7.1.62] model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz

All 10 instances are running on m1.large. But you can see the wide range of CPU types:

  • E5430 @ 2.66GHz
  • E5645 @ 2.4 GHz
  • E5507 @ 2.27 GHz
  • E5-2650 0 @ 2.00 GHz

With little Googling, you can find out which CPU’s better performing than others. What I sometimes do is launch a bunch of instances, check their CPUs, pick the one I want, then terminate the rest. It does take some amount of dedication to do this. Yes you can automate this whole process. If you are really tight on the budget and want maximum utility out of your instance, then this is one option you can choose to exercise.

Email me when Sunny Kim publishes or recommends stories