AWS Placement Groups & SR-IOV — Like a Drum Circle for your Apps

In this quick blurb, I’m hitting most of my really technical stuff up front — muscle through and we get to the AWS Cloud Architectures and where that rubber meets the road of technical implementation.

In the physical IT space, we had a lot of control over how we set up our infrastructure and systems. It’s not talked about very often any more, but the more layers of the Open Systems Interconnection (OSI) Model we specifically designed for optimization in our architectures, the more flexibility, control, and performance tuning opportunities we had. For the newly minted IT folks out reading along, there are 7 layers of the OSI, counting from the bottom up — Physical Layer =1 / Application Layer = 7. They basically took you from the lowest level of an IT architecture, right up to the application behaviors themselves.

Go check out http://www.webopedia.com/quick_ref/OSI_Layers.asp for a good write up of the OSI Model.

The Virtual Computing revolution obscured layers 1–2 almost entirely. The Cloud has taken the OSI model and just shredded it below Layer 5. Yes, folks try to make it 1:1 relevant, but the bottom line is that we have traded Cloud economy of scale, time to market, and flexibility for a lot of the utility we got from the OSI layer cake.

Alright, so there is more to this the point that just some old IT artifact bemoaning the revolution of the Cloud — far more. I’m exceedingly pro-cloud, just short of an evangelist really, however there were some real benefits to being forced to build, configure and manage all the layers of the OSI model that we don’t get with Cloud methodologies. The good news is that we are starting to get some of it back. The bad news is that we have a generation of newly minted IT folks that just don’t understand the uber importance of why these OSI bridge features matter.

One such feature that is trying to bridge Old World controls and Cloud “you can have it in any color as long as it is black” feature inventories, is the AWS Placement Group. By the book AWS defines a Placement Group as such:

A placement group is a logical grouping of instances within a single Availability Zone. Using placement groups with supported instance types enables applications to participate in a low-latency, 10 Gigabits per second (Gbps) network.

For those of us who’ve worked with AWS and cloudly like things from the start, this is a great feature that seeks to replace a wee-bit of functionality & performance tuning that we had in the physical world.

The meaning beyond the definition is that we can give AWS a strong hint that we are deploying a group of systems into a Virtual Private Cloud (“VPC”) that need the lowest latency that they can give us to communicate efficiently between each other. Application clustering, analytics applications, pseudo HPC applications, Big-Data apps — all these network hungry drum-circles that take large amounts of data and derive them down to a focused data set benefit from this feature.

AWS pulls this off by carefully launching your instances destined for a Placement Group in a relationally close networking loop. To do so, a placement group must live in the same availability zone — which breaks the best practice of Multi-AZ deployment. However, if you look at the compute resources in a Placement Group as a temporary cog/assembly, and the data as the pillar of your architecture you can wink clusters of systems in a Placement Group in and out of existence as you need them.

Now hooking up a fast network to a slow machine won’t get you anywhere — in the physical or virtual world. However, you can control this on the AWS Instance type that supports Enhanced Networking and your good to go.

Enhanced Networking used “single root I/O virtualization” (SR-IOV) on both Windows and Linux instances to trick out the networking layer of your instances, and keep the CPU as free from network I/O bottlenecks as possible. This lets you push the throughput to 10Gbps. A new comer to the game is the X1 instances, which support an even more aggressive optimization of the networking layer to achieve 20Gbps.

For you heavy HPC folks out there, this isn’t HDR InfiniBand speeds we are talking about here, but for a fully virtualized environment, that costs far far less than a single low-density InfiniBand switch, this is not a bad compromise. Besides a constant 20Gbps between nodes is not that shabby. If you dabbled in the hardware installation and config side of InfiniBand setting up Enhanced Networking and Placement groups is like an all-inclusive tropical island vacation in comparison.

cfnCluster (a Python based cluster creation script) uses Placement Groups to optimize networking between Compute nodes. You could also use the ElastiCluster utility which also understands and utilizes Placement Groups. Regardless of your choice of Cluster Creation utilities, Placement Groups gives you the ability to build clusters to match your workload needs, as opposed to trying to shoe-horn your applications onto a cluster, by making it easy to build (in a fully automated fashion) disposable HPC Grids.

Trust me — the words “disposable” and “HPC” haven’t been in the same sentence for very long.

Another interesting feature of a Placement Group is that you can deploy them across different VPCs — allowing temporary boosts in inter-VPC networking for applications working together with other systems / teams in your AWS Accounts. (#GeekSqueal) You can’t get the full 10G experience, but you do get benefits over just a VPC Peering connection.

Engineers are a whiny crowd. The older we get, the whinier we are — even if it is only in our own heads. Cloud methodologies really moved the cheese of a lot of folks when it hit. Once Engineers and Architects got to the point of adoption, the next stage of kavitching is feature parity.

As Cloud matures, it will grow these features and move forward — it’s up to the savvy technologist to stop asking “why”, and ask “how” when we feel that inflexive “No!” reaction coming. Placement Groups, and SRV-IO networking aren’t as sexy as some Cloud features, but they open up a number of HPC workloads that we couldn’t service before. We’ve had Placement Groups since 2011, but the increasing functionality hasn’t gotten much illumination.