Cliff Notes: AWS Best Practices (Pt. 1 of 6)
My audio-summary of one of AWS’s most important whitepapers
What follows is my notes for my audio-whitepaper. So it sounds like I’m talking to you. Here’s the audio if you’d prefer that.
Anyways, I wrote a lightweight, cliff-notes style summary of a super boring yet very important whitepaper, the legendary “AWS Cloud Best Practices” whitepaper that is kind of canon in cloud literature, because it summarizes why cloud is important and why AWS is awesome. It’s 44 pages and a boring read — no offense AWS — so read it like 5 times, summarized it, peer reviewed my summary, and turned it into like 10 minutes of audio. Call it an audiowhitepaper. Maybe I should trademark that, but if you’re like me and don’t necessarily love reading 44 pages of whitepapers in my abundant free time, this should be helpful. Before I dive in, a quick preface: if you don’t know anything about cloud computing, this probably isn’t for you. Google “About AWS” to start then come back to this audio. Otherwise, lock and load and enjoy this audiopaper? I might need to work on that.
OK, so there’s really 2 core sections, and these are preceded by a short intro and followed by a brief conclusion. What are these 2 sections, you ask? OK, so the first is called The Cloud Computing Difference, and that’s a short little pitch on why cloud computing is objectively awesome. And then the second Core section is called Design Principles, and that goes into some detail around how AWS works, and I’m just going to list off the chapters in that section; OK here they are:
Optimize for Cost
OK so let’s dive right in. Remember I’m paraphrasing these sentences here, so if don’t like the way I worded something our suspect a leaky abstraction in my analogies, then check out the primary source. Read it for yourself. It’s really not that bad; it’s just 44 pages, and who has time for 44 pages of white papering? Not me. Well, I guess I do, but you know, I do it all for you guys. I do. What can I say, you’re welcome? Alrighty, here we go.
Oh! And one last thing: this paper is for people who know a little bit about AWS and cloud computing in general. If you don’t know anything about AWS, proceed with caution, and if you don’t know anything about cloud computing at all I would just stop now and check out the About AWS page (just google About AWS).
This is a lot of fluff, but the gist is this: migrating your apps into the cloud just as they are (lift and shift) is going to save you a crap ton of money and improve security, but you’ve gotta change the way you think about architecture — you’ve gotta start architecting stuff differently — for it to really work. That’s pretty much it, so let’s move on.
The Cloud Computing Difference
This section has 4 little paragraph-long chapters in it. They are titled “IT Assets Become Programmable Resources”, “Global, Available, and Unlimited Capacity”, “Higher Level Managed Services”, and “Security Built In”. That is, Programmability, Reliability, Manageability, and Security. I remember this with an mnemonic, “Mildly Surfing Pull Requests”.
First, Programmability, or “IT Assets Become Programmable Resources”
Before, in a non-cloud environment, you’d have to think ahead of time, pretty much guess at a theoretical peak, and provision all your infrastructure based on that. Needless to say, this has 2 big downsides: inevitably, you’re either going to (a) overshoot and waste money on servers that are sitting there idly, I like to think of a guy sitting with a briefcase next to a campfire with a stack of hundred dollars bills, flicking them 1 by 1 into the flames, or scenario (b) — which could in theory happen within the same day or week — is that you UNDER-shoot, you UNDER-provision, and all of the sudden a Super Bowl add gives your website the hug of death. Instead, you access only what you need, and better yet, your systems know how to figure out what you need and dynamically scale your resources up and down without you having to turn the dial.
And, in case it needed to be said, the actual spinning up and tearing down of these resources is all done in a matter of seconds — and this applies to database instances, storage volumes, and servers of all kinds. OK, if you’re a student of business you might ask, “So what?” Well, all of this freedom and flexibility and low-cost changes the way businesses think about tech stuff like testing, reliability, and capacity planning, but most of all it changes the way you think about change. And that’s the part that Jeff Bezos really envisioned when he launchd AWS way back in like 2006 — he realized that not having to worry about the tech so much and being able to focus more on how to grow the idea or the enterprise, that would totally change the business landscape.
Second, Reliability, or “Global, Available, and Unlimited Capacity”
This is pretty straightforward, but because AWS is, like, all over the place, and because it turns out that where you infrastructure is located actually kinda matters — app experiences can be delivered to end users with a lot less latency if the underlying servers are, say, down the street, instead of in Taiwan; and then there’s the legal stuff, say, a huge American insurance company that doesn’t want to store sensitive customer data overseas where US privacy laws don’t apply, and lastly, unsurprisingly, there is cost savings, probably in a dozen different nuanced ways that honestly I don’t really care about — all of these things align super well with cloud computing, because you can just deploy to app to whatever AWS Region meets whatever requirements you’re working with. And then there’s also the concept of a global app, where users are all over the world, where you can use AWS’s well-refined high-speed delivery network, CloudFront, to make their experiences faster without you having to manually boot cloud infrastructure within each availability zone. On top of all this, AWS’s console and toolsets makes it a hell of a lot easier to manage and operate production apps and databases across multiple data centers — which before was a nightmare — and this ultimately makes those apps and databases more elegantly orchestrated, which means more availability and more fault tolerance. And then, somewhat obviously, I should add that AWS has unlimited capacity, which, if you think about it, is absurd, outlandish, immodest. I mean, of course you would, AWS. Of course you would. So that’s the insane reliability piece.
Third, manageability, or “Higher Level Managed Services”
This title is a nice way of saying, now you don’t have to hire random people off the street to manage an extremely complicated, even complex, infrastructure your business relies upon; you can just plug into us and we’ll do it for you, because we’re not random people, we’re AWS. And what exactly do they have to offer? Pretty much everything: computing, storage, networking, database, analytics, application services, deployment, security and permissions, management services to keep track of all your other services, services on services on services. I think the current count is 90. And all of them are dirt cheap and most likely better than whatever you’re working with in your server closet. And think about that for a second: you don’t manage this stuff, you don’t have to build it and configure it yourself, you don’t have to hire a bunch of overpaid Oracle consultants to set it up for you, you don’t have all of that commitment and budgeting and capital expense on the line. I could go on, but I think you get the point. Just log into the console, tell it what you want, and you’ve got it. Now that is super manageable.
Fourth, security, or “Security Built In”
Funny how security is, like, always last. But think on the bright side, at least it made it on the marketing/white-paper list! The reason it’s always last is because security is always a buzzkill, which brings me to my next point: AWS builds in security for you. Before, you had this security auditing process, every year or so, that was tedious and expensive and, after it was over, an inevitable list of gotchas, or even reasonable suggestions, that made sense but were a pain in the ass to fix and, let’s be honest, never actually got done. Now, with the AWS ecosystem, you have governance capabilities that allow you to programmatically monitor, even tier with permissions, any changes to your IT resources. And then there’s AWS’s robust encryption service. You can take all of these permissions policies saying who can do what and encryption features can be baked into the design of your infrastructure, so there’s no “one-offs” that could become a vulnerability later. And if you want to get super fancy, you can use AWS to spin up a temporary environment to perform security testing as part of your CI/CD pipeline.
Alright, guys and gals. This is where things get a little more technical. This section is going to dig into each of the core architectural patterns at the core of cloud computing, specifically AWS, and one of the way it does this is by focusing on common use cases and scenarios.
First up, scalability. At a high level this is a pretty straightforward concept. You want your systems to have economies of scale, just like a larger factory, where you’re getting better bang for your buck the bigger stuff gets — and in this case stuff is your IT infrastructure. But, just like a factory, it’s got to be built, gotta be architected, in a certain way (and we’ll get into what way this is exactly); otherwise scale just means more entropy.
OK so there are 2 ways of scaling: you can scale up vertically or fan out horizontally. Let’s take these one at a time.
Scaling Vertically. The way this works in a cloud environment is pretty simple. You’ve got an individual resource, a web server, let’s say, which is running up in the cloud and traffic’s expected to increase tomorrow because it’s Black Friday. If you want it to be better, faster, stronger, you just take it offline for a hot second, change the type of instance to one with more RAM, CPU, I/O, or networking capabilities, and spin it back up. It’s really that simple, but, on the other hand, well, lemme just read AWS’s sentence on this: “This way of scaling can eventually hit a limit and is not always a cost efficient or highly available approach. However, it is very easy to implement and can be sufficient for many use cases especially in the short term.” There you have it: vertical scaling.
Scaling Horizontally. This is basically: increase the number of individual resources and then focus on distributing the load across them. In this scenario, instead of upgrading a fleet of behemoth servers with a larger hard drive or a faster CPU, you’re instead adding more hard drives to a storage array or adding more servers to support an application. AWS says that this is a “great way to build Internet-scale applications that leverage the elasticity of cloud computing”. Let’s dig into some scenarios on where this might work really well.
Stateless Applications. When a client starts interacting with a server, it’ll form what’s called a session. “Session knowledge” is when they meet again and the server says, “Hey I remember you! Let’s pick up where we left off”, but in a stateless application that app server? It doesn’t remember anything about the client — it doesn’t need to — it just does its job and fetches stuff from a database if it needs to, maybe stores something in the database, too. But after the session is over, it’s wiped clean. No “Session knowledge”. As is turns out, this type of architecture works really well with horizontal scaling, because any incoming requests can go to any of the available resources, which are lightweight and can start doing work the moment they come online, and also can be terminated whenever their tasks terminate.
So the million dollar question at this point is, like, this sounds really cool but how do you distribute all this load? AWS highlights two ways this is done: the push model and the pull model.
Push model: One of the most popular ways to distribute all this traffic is through the use of something called “load balancing” (AWS’s solution is called the Elastic Load Balancing service, or just ELB), which, in ELB’s case, is really good at routing incoming requests across a fleet of EC2 instances.
As a side note, instead of using load balancing, some people just implement a DNS round robin (with a DNS/traffic management service like Amazon Route 53). In this case, DNS responses are served up an IP address from a list of valid hosts in a round robin fashion. While easy to implement, this approach does not always work well with the elasticity of cloud computing, because even if you can set low time to live (TTL) values for your DNS records, those resolvers are going to be cached outside of your control, so if you make changes the devices hitting your IP might not respect them for a while. OK, that’s the push model.
Pull model: This is where you hear words like “asynchronous” and “event-driven”, but in this scenario, instead of balancing the load directly onto the servers, whether or not they’re ready for it, traffic is instead just dumped onto a queue in the form of tasks to do or data to process, and then the resources pluck out jobs from that queue whenever they’re ready. The AWS service for this would be SQS (simple queue service) or Kinesis if you’re working specifically with processing larger data sets.
So to recap: we just covered stateless applications, which are one of the biggest use cases for horizontal scaling — which is all about distributing workload across resources — and we discussed the two most popular models for distributing workload: pulling and pushing.
Stateless Components: So that was stateless application architecture. Now let’s talk for a minute about architecting stateless components for horizontal scaling. Again, like before, this is about removing as much state from the underlying resources as possible to that the client server itself is less tightly coupled with it.
User sessions in a website are a pretty good example. If you’ve got a shopping website you might be very curious about what’s in the customers shopping cart in real-time, so you want to store HTTP cookies. While it would be nice to store them client side — in the customer’s browser — this is insecure and causes latency. Instead of then deciding, OK it doesn’t work on the browser let’s just store it on the server’s local file system!, to make this component stateless, you might instead route this data into a database like DynamoDB. In the same way, you can forward hefty items like uploaded video or some interim result of a batch process directly into Amazon S3 or Elastic File System (EFS) to avoid storing them on the server. Another scenario is a workflow of some kind, where tracking each step is important; here, you might store all steps taken via Simple Workflow Service (SWS). Google any of those for more information.
Stateful Components: By default, some layers of your architecture will not become stateless components. Databases, for example. Statefulness is kinda the point. Or, for example, a lot of legacy apps were designed to run on a single server by relying on its local compute resources, so there’s that reality to deal with. Also, when you’re dealing with devices like an online gaming console you may have to maintain a connection to a specific server for a prolonged period of time. In fact, if you’re hosting a real-time multiplayer game you must must offer multiple players the same exact view of the game world with near-zero latency. This is much simpler to achieve in a non-distributed implementation where participants are connected to the same server, the same specific resource.
In spite of all this, horizontal scalability is still on the table with components like these thanks to something called “session affinity”. This has some serious limitations but it’s cool so I’ll include it here: So for the multi-player gaming example, instead of having all the gamers connecting to the same exact server, you’ll instead send them all through the same load balancer — and that load balancer will create that single view of the gaming world — but they’ll actually be connecting to 1 of the say, 300 resources that are running beneath that balancer, and a specific user will still bind to a specific server for the life if of the session (this is called a “sticky session”). You can even put a little API in the console to discover available nodes, to quickly find and reconnect to a node in case the one supporting a user’s “sticky session” terminates for some reason.
Distributed Processing: One last thing on scalability before we move on, specifically about this horizontal distribution concept. So far we’ve been talking about super manageable jobs, like like kilobytes skipping over the wire containing this or that message, but what about if you’ve got a massive data set you need to process in a single job? The load balancing might work, but whichever poor server has the bad luck of getting this job is going to be busy with it for days. For this kind of situation, AWS recommends a distributed data engine like Apache Hadoop for offline batch jobs. AWS actually has something called Elastic MapReduce (Amazon EMR) where you can run Hadoop workloads on top of a fleet of EC2 instances. In case the processing needs to occur in real-time, Amazon’s Kinesis service uses a lot of the same sharding-style technology but with extra streaming magic.
So a quick summary on scalability. Vertical scaling in the cloud is not ideal but a good start if you’re, say, lifting and shifting. Horizontal scaling is way cooler, but you have to focus your designs around statelessness to really see its full glory, and you can achieve this in your apps and components, even your batch processing, but all of these with some caveats.
Disposable Resources Instead of Fixed Servers
This one is a big mindshift for anyone moving from hardware-based infrastructure architecture, especially if you’re a ops guy or gal. Individual servers, instead of manually-crafted, carefully-guarded assets— think of an artisan-like sys admin driving to the data center to SSH into each of his eight servers to meticulously apply the latest patch or config change or hardcode some IP’s, the list goes on — are now disposable resources. In fact, they’re meant to be disposed. Like the little paper cups on race day. One inherent issue with physical infrastructure, even long-standing cloud infrastructure, is “configuration drift”, where the natural force of entropy and human error will inevitably — every time — result in slight, undocumented discrepancies between each server. Now, with disposable infrastructure we can build our resources to be immutable — meaning you can’t alter them, and have to instead alter the code that created them, then cycle in a new generation of servers created from the updated code. So for me, whereas I’m calling conspiracy on the planned obsolescence of my child’s toys, I’m pretty happy with the stateless volatility of our application layer because I know that our app is running on servers with the latest config, always in a consistent (and tested) state, next to zero change of human error, with the option of an quick and easy rollback…or roll-forward, if you will.
Instantiating Compute Resources: So before I step off the soapbox of disposable resources and immutable infrastructure, let me just quickly give you a peek into what this might look like, like, the ways people are implementing this, with a little more detail. The AWS whitepaper has 2 ways: “bootstrapping” and “golden images”, and then what they call a “hybrid” approach which issues a little bit of both.
Bootstrapping: When you go to boot an AWS resource like an EC2 or even a Relational Database (RDS) instance, you get the standard, default configuration — a clean slate, if you will — over which you can run your own universal, versioned bootstrapping script. In this scenario, this little, or big, shell script is where you put your focus. If you want to get fancy, you can use Amazon’s OpsWorks lifecycle events to apply Chef recipes or Amazon’s CloudFormation service to apply scripts via Lambda function after healthchecks pass.
Golden Images: If you want to skip the “default config + my scripts” process, you can just snapshot a particular state of your resource and then launch all future instances from this “golden image” instead of the AWS default. This works for some core AWS services like EC2 instances, RDS instances, and EBS volumes, but it can save you a ton of time and remove dependencies on some the extra technologies I mentioned in the bootstrapping section. This can actually be important if you in mission critical auto-scaling environments where you need to be able to quickly and reliably launch new resources when traffic spikes. Remember, though, you’re still going to need a versioning process to remember how you got from the AWS default to your golden image. AWS actually recommends you go and and script this out, just so it’s easier and less prone to human error.
Oh and two little notes: first, AWS has a vibrant community and a robust marketplace specifically for AMI’s — vendors are selling them and open-source people are sharing them — so this might help if you want a head start writing your own AMI. If you think about it, this is super cool — in this new world where servers are commoditized and extra custom code on top of them is where the value lies, if that wasn’t cool enough there’s a community and affordable marketplace for that extra code. Anyways, I just think that’s cool. Just google AWS ami marketplace or aws ami community catalog to check this out.
And the second little note: for those of you who are on an on-prem virtualized environment, running not on the public cloud but at your own datacenter or whatever, AWS actually has a tool called VM Import/Export which can convert pretty much any Linux virtual environment — whether that be VMWare or Hyper-V or plain old Xen — into AMI format.
Most people who are using “golden images” are using it for EC2, but don’t forget that you also use it for databases instances or storage volumes. Like, pretend you’re spinning up a new test environment, instead of using the default, clean-slate RDS image or storage volume and then running some humongous SQL script on it, you can do all that just once, snapshot that instance or volume as your “golden image”, and for all future tests you just spin up that image and have all that data prepopulated, ready to use.
And while we’re on the subject of golden image, the golden-est of golden images is going to be containers. Whether it’s Kubernetes or Docker, the new container workflow technologies make it super simple to build and deploy apps within containers. If you’re not familiar with them, containers are basically.
In the case of Docker, long story short is that you can just package a piece of software in a Docker Image, and this image will contain everything the software needs to run: code, runtime, system tools, system libraries, and so on . You can then run these Docker Images they become Docker Containers, and they can live on AWS services like AWS Elastic Beanstalk and AWS’s container service, Elastic Container Service (or ECS), which basically allow you to deploy your apps across any number of these Docker containers across a cluster of Amazon EC2 instances. I will NOT digress on Docker though I’d love to, but instead I’ll give you a pretty diagram or two:
Hybrid: Needless to say you can mix and match both the bootstrapping approach — where you booting the default and running versioned scripts — and the “golden image” approach where you’re maintaining a single snapshot of the image you want. Let me kinda elaborate with some examples. If the server environment — or the database instance or storage volume — either (a) doesn’t either often, or (b) introduces lots of external dependencies say from third-parties — like if you’re apt-getting out to some repo each time to get some open-source library, these are good candidates for having “golden images”, because golden images are kind annoying to update (thus the “doesn’t change often” requirement) and you don’t want to be dependent on a third party library which may not be 100% reliable or secure. Store it in a golden image and life is easier in this scenario.
Now for bootstrapping, if you’ve got resources that change super frequently that’s an obvious use case — high-velocity, greenfield development, is a great exmaple. Here, you could use a base AMI as a stable “golden image” then just modify the boot script to pull in version of the app — so you’re using both, see how a hybrid approach might be nice here? Another example would be if you’re deploying apps dynamically depending on the environment, like let’s say UAT needs to point to UAT database, and Test environment needs to point to Test, you probably wouldn’t want to hard-code that database hostname config. Instead, it makes way more sense to bootstrap that part. Another example: You would also not want to hard code the database hostname configuration to your AMI because that would be different between the test and production environments.
As far as AWS services go, AWS Elastic Beanstalk follows this kind of flexible, hybrid model. It provides preconfigured run time environments (each initiated from its own previously snapshot’d AMI) but allows you to run bootstrap actions and configure environment variables to parameterize some of the environment differences. OK, so that’s the hybrid model.
As a quick summary of the “Instantiating Compute Resources” chapter, we covered how in a cloud environment you’re going to be spending alot of time spinning up new servers and tearing down old ones, and spinning up new ones again, and so on. Needless to say you’ve gotta automate this — if not to prevent human error, then because it would take forever to do this manually and slow down the time it takes to fix stuff or deploy new stuff — and there are 2 ways to do this: Bootstrapping which is all about the script, “golden images”, which is all about the snapshot, and then a mix of the two, the “hybrid approach”. And, of course, AWS has some awesome tools to make this easier. Surprise surprise.
Infrastructure As Code: At the end of this Resources As Disposable Paper Cups section, AWS added a little paragraph that make it clear — perhaps somewhat redundantly — that, and I quote, “The application of the principles we have discussed does not have to be limited to the individual resource level”, going on the say that you can apply this to your whole infrastructure. If this is a new, profound point apart fromw what we’ve already covered, it’s evading me, but I thought I’d include it anyway. Oh, and to close the white-paper has one of those little gray digression boxes about AWS CloudFormation, which, according to the box, gives devs and sysadmins an easy way to create, manage, teardown a collection of related AWS resources using templates, and then provision and update those templates — and the resources that are based on them — in an orderly and predictable fashion. So like you can drop a CloudFormation template in your app’s repository, and it’ll describe the kinds of AWS resources it needs to run — as in, the whole stack from DB’s to web servers to microservices, and all the dependencies and run time parameters it needs, and you just launch it and there you have it: the full stack deployed to the cloud. So, again, nothing new here: this is infrastructure declared in code, so it can be scaled up and down and mirrored in lower environments for testing.
This is a one-page section, and it literally just bullet-points seven (7) key automation services that AWS provides. And, needless to say, automation in the cloud is all about not having to manually react to events, right? OK, so here they are:
- AWS Elastic Beanstalk: This is touted as the fastest and the simplest way to get an app up and running on AWS. All you’ve gotta do is upload their application code and the service, quote on quote “automatically handles all the details”, like resource provisioning, load balancing, auto scaling, and monitoring.
- Auto Scaling: As the name implies, this tool is all about automatically scaling up the infrastructure that powers your applications, so you don’t have to worry about maintaining application availability. Just define all the parameters and thresholds (this gets to that capacity then scale to this degree), so you auto-scale in a way that focuses on availability zones to focus on availability, or you can auto-scale aggregate CPU — like in the example before — across all your EC2 instances to focus on capacity.
- Amazon CloudWatch Alarms: So there’s a service called CloudWatch that basically sits there polling your services and keeping all kinds of metrics on what they’re doing and how they’re doing, are they feeling a little cranky, etc., and on top of CloudWatch you can build little alarms that ring when x metric goes beyond y threshold within z time frame…but instead of dinging an alarm bell it sends an Amazon Simple Notification Service (Amazon SNS) message to a topic, which then in turn can kick off the execution of a subscribed AWS Lambda function that’s listening to that topic, which can in turn push a second notification message to an Amazon SQS queue, which, unlike SNS, can do pretty much anything, like do an HTTP POST to a ChatOps client so people know what’s going on.
- Amazon CloudWatch Events: The CloudWatch service delivers a near real-time stream of system events that describe stuff that’s going on with your AWS resources, and routes little messages associated with this event to AWS Lambda functions, Amazon Kinesis streams, Amazon SNS topics, etc., or any combination of those.
- AWS OpsWorks Lifecycle events: OK so remember how I mentioned back in the bootstrapping section about using an AWS tool called AWS OpsWorks to trigger some scripts to be run — or Chef recipes to be applied — to fresh instances when they come online? Well AWS obviously includes this in this Automation section, so let me tack on 1 more example for you: in the case of an extra MySQL instance being created on the data layer, the OpsWorks configure event could trigger a Chef recipe that updates the Application server layer configuration to point to the new database instance.
- AWS Lambda Scheduled events: These events allow you to create a Lambda function and direct AWS Lambda to execute it on a regular schedule. For example, this could shut down all EC2 instances with a certain tag type when, say, the office closes down over the weekend, and then start them up again on Monday morning.
- Amazon EC2 Auto recovery: This one has some caveats so don’t get too excited about a one-click solution, but you AWS has an EC2 auto-recovery feature, where you can create a CloudWatch alarm that keeps an eye on any given EC2 instance and automatically ‘recovers” it if it becomes impaired. I use air quotes because it’s actually just cycled out gracefully with a fresh version —same AMI, same instance ID, private IP addresses, Elastic IP addresses, same instance metadata, but without any in-memory data that, in this case, will be lost. So if this seems like it might solve on of your problems, check out the docs because quite a bit to this one.
As far as design patterns go, this is one of the most important ones. I like to think of a bike with shocks. The shocks are an interface, between that front tire — which is always smashing on rocks — and rest of the bike, so you could say the tire is loosely-coupled with the bike, they’re part of the same entity and they’re both super important and mission critical, but there’s a shock in between them so impact suffered by the tire doesn’t directly transfer to the bike frame (and for that matter the rider on the bike). The two parts are still coupled, but they’re loosely coupled. That’s it at a high level. But let’s dig into some specific examples of what this means in the world of cloud infrastructure.
Well-Defined Interfaces: Let’s talk a little more about the interface thing — the shock itself — but let’s drop the mountain bike analogy for a minute. Let’s now think about computer systems. Let’s say you’re a company that sells auto-parts online. You’ve got a Sales System, an Distribution System, and a Customer Accounts System. These systems all talk to each other to go from customer registration to online purchase to order fulfillment. As these systems grow and change and scale in their own somewhat unique ways, they need well-defined interfaces for communicating with each other — they need, as AWS words it, “specific, technology-agnostic interfaces”. To use a super common example, this company might adopt RESTful API’s as their interface, and they say, “Look, design your system however you think is best but you’ve gotta do 2 things: first, it’s gotta be backwards compatible, meaning none of your new stuff will break the old stuff, second, its functionality’s gotta be exposed in the form of REST APIs so the other systems know how to talk to it, and, third, not only does it have to be exposed in a certain way, it’s gotta talk to — and listen to — 1 and only 1 single, unique entry point, a “gateway”.” And so what you have is all your services talking to each other in the same way, through the same well-defined Gateway interface. Amazon has a pretty popular service for this, aptly named ‘AWS API Gateway”.
Tell AWS that there’s a grammatical errror “Difference” on page 15.
Service Discovery: This concept is sort of oriented the idea that, now that we’ve got new EC2 instances changing everyday — remember, in this immutable environment we’re taking them down and spinning them back up again with fresh updates, and even if you’re in a containerized microservices environment those underlying EC2 instances are changing all the time —so gone are the days where you’d just grab the IP of whatever compute resource or resources you needed and hardcode it into the service that needed to use it’s CPU. If you want to start loosely-coupling stuff like this — so that the front-end service and the back-end service are loosely-coupled — they shouldn’t have to know about each other’s details. Instead, they should have some way of, on a moments notice, discovering the available services that are suitable for connection — they can do this on their own, or they can hit an API that does this for them. But the point is they shouldn’t know the network topology of all the services it relies upon. And if you’re going to make any progress on loosely-coupled systems in an elastic cloud environment you’re going to need some form of service discovery.
AWS mentions their Elastic Load Balancing (ELB) service as a simple way to achieve service discovery. Because each load balancer is abstracting away the underlying resources and instead offering up 1 single hostname, you now have the ability to consume a service through a stable endpoint. If you’ve got say a list of otherwise similar ELB’s across varying DNS and private Amazon Route53 zones, so that even the particular load balancer’s endpoints can be abstracted and modified at any point in time.
The whitepaper also mentions some other options for registering new services and discovering existing ones. Of course you can write custom solutions using a combination of tags, a highly available database and custom scripts that call the AWS APIs, but you can also use tools like Netflix’s open-sourced Eureka, which last time I checked had over 5,000 stars and 1,000 forks on GitHub, Airbnb’s open-sourced Synapse — which is a little less popular — or HashiCorp’s open-sourced Consul, which has 11,000 stars and over 2,000 forks. One big caveat to all of this: because this stuff is so critical to any cloud infrastructure, take this seriously; build this correctly, so it’s highly-reliable and super available. Because if this goes down it’s like the lighthouse going out or something, or maybe like someone cutting the power cord to the air traffic control tower.
Asynchronous Integration: Remember way back when when talked about the horizontal scaling design pattern, and how the stateless application architecture works really well with horizontal scaling — that is, your apps are just running on a fleet disposable resources that are essentially nameless — and then remember how the 2 ways to approach stateless app architecture are the push model (using a load balancer to essentially round-robin the requests across the underlying instances) and the pull model (allowing the instances to pull from a queue when they’re ready to do work)? OK, so this next section, “asynchronous integration” is pretty much the same thing as the pull model, so I’ll keep it brief. In fact, I’ll just use an example: a Sales System takes the customer’s order info, puts it into a message, and dumps that message onto a Warehouse Queue. Meanwhile, the Warehouse System is grabbing messages from that Warehouse Queue whenever it’s ready to process them, and that processing results in a product unshelved and packaged, some stuff stored in a database and another message that dumps onto the Fulfillment Queue. And then, you guessed it, there’s a Fulfillment System that’s grabbing messages from it’s Fulfillment Queue whenever it’s ready to ship the item to the customer’s house. So, for example, let’s say the Warehouse System starts moving slower than usual, or the Fulfillment System goes down for a day. In this scenario, the whole sequence gets backed up, yes, but it doesn’t come crashing down. The Sales System doesn’t sit there, locked up, confused, and eventually crash while it’s waiting to hand off some transmission to the Warehouse System. Instead, it just throws stuff on a queue when it’s done without any knowledge of what a Warehouse System is. This also allows for some flexibility with scaling. For example, is a Super Bowl ad results in some spike in orders, your highly-scaled Sales System can ramp up to handle the traffic, but the Warehouse System, it can just keep working day and night from its queue at a steady clip and, yes there’s a backup, but you’re still meeting your 14 day shipping commitment so no worries, no need to spend money scaling up that Warehouse System.
AWS gives us some more interesting examples of this type of architecture (and I’m just going to read these verbatim because they’re pretty short):
- A front end application inserts jobs in a queue system like Amazon SQS. A back-end system retrieves those jobs and processes them at its own pace.
- An API generates events and pushes them into Amazon Kinesis streams. A back-end application processes these events in batches to create aggregated time-series data stored in a database.
- Multiple heterogeneous systems — ones that don’t necessarily play nicely together — use Amazon SWF to communicate the flow of work between them without directly interacting with each other.
- AWS Lambda functions can consume events from a variety of AWS sources (e.g., Amazon DynamoDB update streams, Amazon S3 event notifications, etc.). In this case, you don’t even need to worry about implementing a queuing or other asynchronous integration method because the service handles this for you.
Graceful Failure: Last but not least, one of the patterns within the Loose Coupling doctrine is graceful failing, that is, building apps so that stuff doesn’t blue screen when they’re underlying components fail. AWS has one of those little gray digression boxes to explain how this works in practice, and those little boxes are always cool so I’m going to digress a bit into the wonderful world of graceful failure. Just hang tight I won’t stay here long.
OK so when a database goes to write to some record it first pulls the record, without applying any locks on it, and then just before it writes it checks to make sure no one has modified that record in the millisecond between when it pulled the record and when it went to write. Every now and then, a competing process will have done just that. There’s a thing called pessimistic concurrency control, where the client will literally lock down the record during a write transaction, so this is whole competition thing is eliminated. That makes stuff super slow and sucks for a lot of reasons so instead, we’re going to talk about OPTIMISTIC concurrency control, which basically is a way of just telling the client to abandon the write, pause for some short period of time, say 10 milliseconds, re-pull the record and try again, if this kind of conflict happens. But OCC is just that — optimistic — and when you get into what nerds call a “high contention” environment, like where, for example, you have a zillion people commenting on a single viral Trump tweet, the performance of your otherwise-awesome database is going to really suffer, because for every 1 “round” of the database choosing which client’s write “wins” and gets to succeed, you’ve got all the clients that didn’t succeed still pulling the record and attempting to write, which still does cost the database in terms of performance.
And if you think about it, while it may take x amount of time to complete x amount of transactions, the higher than number x gets, the higher the per-transaction cost to the database. Let’s say x is 100, and it takes 100 seconds to get all 100 transactions done. That’s find and dandy — it’ll take 100 seconds — but for the system each transaction becomes exponentially costly. This is a little confusing if you’re not a math person, but let’s say you’re the database and you’ve got 100 competing clients that are going to try to win that transaction write — that means the database has to serve up the record to 100 clients only to have 1 of them win. So for the 100 number, the database has to serve up 100 + 99 + 98 + 97 and so on total records as it works its way down from 100. So in that way, work done by the system increases exponentially, which is no bueno for the overall system, and no bueno for the budget. OK, so that’s OCC.
To account for the flaws of OCC, there’s something called “exponential backoff and jitter” which sounds like things a fighter jet would do mid-air but is actually just a way for the database to talk to clients where the database says, OK, you gave me a transaction that failed so backoff for 2 seconds and try again, OK you failed a second time? Make that 4 seconds. A third time? Make it 16. Fourth? ONE HUNDRED AND THIRTY SIX. Typically you’d cap this but you get the point. You’re telling the clients to chill out and come back in a little while. This exponential backoff kinda of works, but it has a big problem. For the 100 competing clients example, you’re crowning 1 client victor but sending 99 of them off with a 2 second wait. Next round, you’re sending home 98 of them with a 4 second wait. Next round, you’re sending 97 of them home with a 16 second wait, and then the next you’re sending 96 of them with a 136 second wait. So then what you’ve got is the clusters of calls, which isn’t good because there’s still going to be 1 victor at any given moment. What you want to do is realize there are a ton of calls and say, OK, instead of all of you clients coming back at the end of the exponential backoff amount — let’s say it’s 136 seconds — instead, come back at a random time between now and 136 seconds. So this little jitter makes it so you’re spreading out the load in a randomized way — albeit guided by this exponential backoff value, which is going to increase with contention — and making it so you’ve got less clients competing at the same time. You could call it “intelligent load scattering” but programmers are calling it “exponential backoff and jitter” so we’ll stick with that. Pretty cool, huh? If you’re brain hurts, read this.
OK, back to your scheduled programming. And on that note, we’re done with Graceful Failure.
Services, Not Servers
Now onto our next AWS Cloud Best Practice design pattern: what I’ll call “services are better than servers”. The intro sentence is basically like, “If you’re going to use EC2, you might as well use all of the compatible sister services if you want your developers to be more productive and your operations to be more efficient.” Then they kinda expound on two different categories of these services: managed services — where you’re creating and maintaining AWS infrastructure — and serverless — where you’re paying for specific functions to be executed, not the underlying infrastructure used by those functions.
Managed Services: OK, first let’s cover what AWS means when they say “managed services”. Honestly there’s not much to say here that hasn’t already been said. AWS has databases, machine learning, analytics, queuing, search, email, notifications. And the idea is like, OK, instead of building and operating and updating and hotfixing a highly-available automatically-scalable world-class RabbitMQ messaging cluster, why not just plug into Amazon’s version of RabbitMQ, SQS? And the same logic applies to something like S3 — it’s a one-stop shop for storing pretty much any type of data — you just click your configurations and requirements from a drop-down, there’s no limit to how much you can store, it’s automatically replicated, I mean you can even use it to service static content to web and mobile apps who are seeing millions of users a minute. That’s pretty rad, I’ll admit. Here’s the last little paragraph, which I’ll read verbatim because you can kinda sense the author getting kinda excited, like he’s hyped on the AWS cool-aid, here it is: “There are many other examples such as Amazon CloudFront for content delivery, ELB for load balancing, Amazon DynamoDB for NoSQL databases, Amazon CloudSearch for search workloads, Amazon Elastic Transcoder for video encoding, Amazon Simple Email Service (Amazon SES) for sending and receiving emails, and more” with a footnote after more, as if I’m just dying to click it. OK we get it, let’s move on to Serverless Architectures.
Serverless Architectures: This is kind of a next level way to reduce — to abstract — operational complexity is some magic fairy dust called Serverless. I call it magic dust because, I mean, it’s kinda crazy that you can build services for mobile for web, event-driven or synchronous, services that do analytics, services that handle streaming of Internet of Things (IoT) device data, all without managing any server infrastructure. For the old mainframe guard I’m sure this sounds heretical, but AWS is investing a ton in this serverless thing. They have AWS Lambda, which is where you can upload your code and it’ll run given x or y trigger, and AWS just charges you for every 100ms your code takes to run plus a tiered fee given how many times this function is running. You can host an entire website experience using Lambda for the logic and S3 to serve the content and API Gateway for opening the door to the Lambda function, if you need it, Cognito for the authentication, like that is without a single EC2 instance — and lots of companies have done this. In fact AWS has an entire whitepaper devoted to this exact thing, you should check it out if you’re getting fired up — or offended — by this heretical serverless architecture stuff. It’s called AWS Serverless Multi-Tier Architectures”.