Using AWS’s (Ruby) SDK
No Pain, No Gain
In my previous blog post, I provided a vision for what we want to achieve at ACL for our ACL GRC SaaS platform. This post will start going into the details of how we have implemented our vision. In particular, in this post I will share with you the hurdles associated with using AWS’s (Ruby) SDK and how we overcame them.
AWS’s SDK — Programmatic Access to AWS
To make our vision come true, we ultimately chose to use AWS’s SDK amongst the myriad of options available. It was too risky to lock ourselves into any 3rd-party community library or tool that may stop receiving support, or worse, become obsolete. We needed to be future proof and have unrestricted access to AWS’s resources; thus we settled on AWS’s (Ruby) SDK. Unfortunately, what we expected from the SDK turned out to be far different than what we got.
More Low-Level than Expected
When you first evaluate AWS, you’ll probably start by using AWS’s Management Console (i.e. the web interface) and recognize how easy it is to set up everything. A click here, a tweak there, and voila, boxes are up and running. How awesome, right? Surely the Management Console is a thin wrapper around an SDK they must be using underneath, and this SDK must be what they release publicly, right? That’s a natural train of thought to have, but — surprise surprise — that’s not the case.
Unfortunately, AWS’s SDK is implemented at a lower level of abstraction than we anticipated. It exposes a lot of internal AWS details that we did not want to see. And guess what, we don’t like seeing what we don’t need to see.
Here is what you’ll quickly learn from using the SDK:
- Simple single button actions on the Management Console require significantly more steps to do via the SDK. On the console, you select your VPC, select “Delete” and *poof*, it’s gone. In the SDK, you need to delete everything in your VPC, one-by-one, in the appropriate sequence, with the appropriate wait-until-complete logic, before you can — finally — remove your VPC. Thus your expectation of deleting a VPC with a single line of code is in reality 100+ lines of code with lots of orchestration logic built around it. In order to delete a VPC, you need to do the following in this (rough) order.
1. Terminate all instances in your VPC
2. Delete all ENI’s associated with subnets within your VPC
3. Detach all Internet and Virtual Private Gateways (you can then
delete them and any VPN connections, but that’s not required to
delete the VPC object)
4. Disassociate all route tables from all the subnets in your VPC
5. Delete all route tables other than the “Main” table
6. Disassociate all Network ACL’s from all the subnets in your VPC
7. Delete all Network ACL’s other than the Default one
8. Delete all Security groups other than the Default one (note: if
one group has a rule that references another, you have to delete
that rule before you can delete the other security group)
9. Delete all subnets
10. Delete your VPC
12. Cry (if done manually)
- On the Management Console, failures are infrequent; while in the SDK, they are frequent. These failures are not ones you’d typically expect either: calls to provision resources can fail despite successful responses; a resource is created but cannot be found immediately; the SDK may mention three availability zones are available, but in reality, only two are available to provision new resources. To be fair, such behaviour is not uncommon for a highly distributed system; but it will still surprise you — especially if you come from an application development background.
- On the Management Console, you’ll rarely hit rate limits; but your code will hit many rate limits. If you do too many operations against AWS within a short time period, it will start rejecting your calls. This problem is amplified when multiple developers are using the SDK at the same time and in the same region. While initially writing simple scripts will not expose this problem, a complete automated test suite built around the SDK will. In fact, once you start making numerous back-to-back calls to AWS’s APIs at once, you’ll notice every API endpoint can fail and each will fail with a uniquely named exception error, e.g. Aws::EC2::Errors::RequestLimitExceeded, Aws::AutoScaling::Errors::Throttling.
- Once you get past the rate-limiting issue, you start having timing issues. AWS expects a certain state and order for operations to succeed. When a step involves provisioning resources, subsequent steps must carefully time their executions, otherwise they will fail. For example, if you want to create an ElastiCache instance and associate an AWS tag to it, the SDK can start creating an instance, but you have to wait until it is created — and in the right state — before adding a tag to it. Yet — even then — adding a tag may fail because the instance was too recently created.
- Lastly, to top it all off, AWS’s SDK has inconsistent interfaces. You may utilize Relational Database Service’s (RDS) API and believe that switching over to ElastiCache’s API will be trivial due to how similar these resources are; yet, you’ll realize they have many nuanced differences. For example, RDS resources require parameter groups, while ElastiCache resources do not. You’ll find that while your own domain classes have a similar interface, the required AWS API calls underneath are different.
These are only the obvious pain points you’ll encounter when you first start using AWS’s SDK. You’ll unfortunately hit other unexpected pain points that can confuse you and catch you off guard. For example, what do you do when the SDK returns a success response but in reality the AWS operation failed internally? ¯\_(ツ)_/¯
Overcoming the Pain
To overcome these hurdles, we decided to do what software engineers do best: abstract away the complexity. In this case, we decided to create three layers of abstraction in order to implement our vision.
First Layer of Abstraction — AWS SDK Simplification
The first layer was an Object-Oriented CRUD interface over AWS’s services. Rather than having to use AWS’s SDK API to create an ElastiCache instance like so:
We could instead create an ElastiCache instance like so:
We used the same new, persist, find, remove interface across all our AWS resource abstractions. Our interface hid the complexity of rate limiting and other orchestration details (such as waiting for a resource to truly persist), which the SDK didn’t. This step alone drastically increased the legibility and usability of AWS’s SDK.
Additionally, rather than exposing every configuration parameter possible for these resources, we created a two-tiered system for such parameters. The first-tier exposed key configurations we cared about as individual fields in the class. The second-tier exposed all the remaining AWS configurations in an extra_options field that merged in with the underlying SDK API calls. If any second-tier configuration became important for our use, it would be promoted to the first-tier. This two-tier system allowed a more intuitive abstraction without limiting the underlying power of AWS’s SDK.
A (simplified) example showcasing our abstraction and use of two-tiered configurations is shown below. Note how name and subnets parameters are first-tier configurations, while aws_options is a bucket for all second-tier configurations. Additionally, note our use of our helper class Reliably which alleviates rate-limiting issues and waits for the ElastiCache resource to fully persist in AWS before returning from its persist method.
To support our abstractions, we added numerous tests around each component to ensure ongoing stability. Each abstracted service would have its own unit tests and integration tests.
We also took the time to make sure we could test from a completely clean slate every time; meaning, we would create a VPC, perform our tests, then delete the VPC entirely before considering a test a success.
Test isolation is an easy concept to understand; but with AWS, there are two gotchas to keep in mind. First, AWS accounts are by default limited to 5 VPCs per region. With parallel tests, we quickly hit our VPC limit. We had to request a bump to 10 VPCs, and so far we’ve been OK. Second, we designated an AWS region to each infrastructure developer. This prevented test failures due to AWS resource naming collisions, in case both developers were running their test suites in the same region. So far, it’s been working well.
Second Layer of Abstraction — Grouping Common Services
With our first layer of abstraction in place, we noticed patterns emerging that could further be abstracted away. One pattern was the common case of setting up the following resources together in order to have elastic EC2 instances:
- Launch Configuration — this holds the configuration required to boot an EC2 instance.
- Auto Scaling Group — this allows the number of EC2 instances to be automatically scaled up or down based on load. In addition, it automatically replaces unhealthy instances with new ones.
- Load Balancer — this (ELB instance) directs web traffic to the EC2 instances.
In fact, all of our EC2 instances in AWS have a Launch Configuration and an Auto Scaling group associated to it. It did not make sense that we keep repeating our logic to create these resources individually. Instead, we decided to create an InstanceGroup class to abstract the complexity away.
Now, we primarily interface with our InstanceGroup class in our code base rather than the underlying resources.
Third Layer of Abstraction — ACL GRC’s Infrastructure
The third layer of abstraction is where our business’s products, goals, policies, and objectives are represented. This means creating classes and objects that represent our infrastructure, which uses numerous AWS services. For example, in ACL we refer to the infrastructure associated with an application as its “cluster” and thus we have a cluster class per product. These clusters are placed within “ACL optimized” VPCs, so we have a class representing these VPCs as well. And with each class and object, we configure it for our particular requirements (e.g. default EC2 instance sizes to use, ELB ciphers to enable, etc.). Policies and settings are embedded directly in the code! It won’t be lost!
I must admit. When we first decided to use AWS’s SDK, I did not expect it to have such a steep learning curve. My experience with other Ruby libraries had been quite smooth, so I simply expected the same. Unfortunately, we were caught off-guard and had to create various abstractions in order to use AWS’s SDK the way we wanted to.
Now, it can be argued that I’m simply too demanding of AWS’s SDK. After all, with such a highly distributed system, I should not expect synchronous behaviour from an inherently asynchronous system that (probably) relies on eventual-consistency in its persistence layer. After all, it’s not as though AWS is a small service provider; they are one of the biggest. Having said that, I do believe AWS’s Ruby SDK can benefit from Ruby community’s culture and approach to software development. The Ruby community is all about programmer happiness and legibility. Ruby itself has the “unless” keyword, rather than relying on the negation of “if”, just to increase legibility! This community cares a lot about ease-of-use. If AWS’s Ruby SDK had the same aspiration, I believe few people would be using any alternative 3rd-party AWS library or DSL. Instead, we would all be using the SDK directly. Hopefully one day, this will be a reality.
On the plus side, having gone over this hump, we had learnt lots — perhaps too much — and were ready to start using our new abstractions on AWS!
In my next blog post, I will explain our approach to deploying applications onto our infrastructure and how our abstractions allowed us to perform Blue-Green Deployments. In particular, I will explain the concept of an Immutable Infrastructure and how it will significantly improve your ability to maintain and reason about your infrastructure.
You can read the part one of this series here.