Serverless is the PaaS I Always Wanted
In the early days of Cloud computing, there was a really simple picture we would share to explain the Cloud Service Model, sometimes with a diagram like the pyramid above. I’m sure you’ve seen a version of it — a way of showing the different service types as a stack, highlighting the relationships between each. When I gave training for new associates beginning to use Cloud, the instruction would cover the different service layers, and the definition of each one would be spelled out.
Infrastructure as a Service — the foundation to the new Cloud computing model. I’d articulate how using this model differed from owning your own datacenter, including no longer having to focus on managing facilities, and the per unit cost benefits that these major providers have in scale for racking commodity servers and top of rack switches.
Platform as a Service was well — a work-in-progress. There were some real world examples, but much of it was still describing what was possible vs. what was being used in the marketplace. The advantages were clear as we needed more than just a provider providing compute capacity, and I would cite this reference from NIST that carried great appeal.
Cloud Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations. — source: National Institute of Standards and Technology, ca 2009.
Software as a Service — is where speed and agility would really be unlocked. There were great examples to highlight the success and growth of Office365 and Workday, as well as many industry specific services. The model and benefits were easy to articulate given the time to market savings, and it fit nicely into the microservices discussions that were also being introduced at the same time.
Given the need to build custom servicing systems and leverage standard engineering patterns, the concept was great, but of the three service types, the PaaS market was slowest to develop, and without great success stories, harder to explain the value proposition to others.
Cloud Building in 2017
Okay, fast forward to today, and let’s check in on our options for building infrastructure in the Cloud. How are they similar and how are they different from when I first gave my training?
Given that it’s 2017, let’s use a common contemporary investment project as an example — building a voice chat-bot. This hypothetical chat-bot will be a friendly voice-driven assistant that can operate on the Amazon Alexa platform and help walk individuals through finding and booking their ideal vacation getaway. Marketing says it will be a big hit this winter as people seek to escape the cold weather. Therefore, we should assume it can handle thousands of requests per second; and when running our ad campaigns, potentially service millions of prospects each day. Let’s also assume we’re looking to push the envelope for Amazon Alexa development so the returned content will not just be spoken words but will also include the ability to play sounds from exotic locales and use the card feature to show pictures of possible destinations.
Option #1 — Solution using IaaS
The first alternative is to use the foundational parts that a Cloud provider like AWS offers, similar to a traditional on-premise datacenter. We will benefit by having all of this infrastructure pre-provisioned for us, but will need to automate the provisioning of the different parts, as well as the assembly. The solution might end up looking like this.
Within the new platform are a series of stacks:
- API interface stack that will bridge the Amazon Alexa platform to our business logic, written in Java. Typical Web/App layer hosted using separate EC2 instances, probably going with memory optimized instance types.
- Content management to serve images and audio clips. These are likely general purpose instance types.
- A noSQL solution like MongoDB for persisting data for analytics. Might need some storage optimized types here, and get extra IOPS.
- An ELK (Elasticsearch, Logstash & Kibana) stack for real-time data collection and dashboarding. Once again, general purpose instance types will do.
- A separate compute cluster for running analytic processing on the transaction data captured. Once again, storage optimized and extra IOPS.
Each stack will have an ELB in each region, it’s own cluster of EC2 instances (complete with auto-scaling groups), as well as requisite EBS volumes, subnets, and security groups. We will automate it all via CloudFormation.
How much infrastructure is needed? Well it depends on what our non-functional requirements, but assuming extensive use of auto-scaling, running workloads across zones and regions to provide a HA solution, blue-green deployments for code rollouts, and non-production environments, around 60–80 EC2 instances. Let’s say half of those are on-demand, and the others we can convert to Reserved Instances. Throw in a dozen or so ELB’s, 120–150 EBS volumes, some with high IOPS for the analytics. Sizing matters, and while the web layer we might be able to get down to small general purpose instances, the analytic and content management components may be specialty types with large capacity.
Option #2 — A Serverless (PaaS?) Design
The first option has many pieces and parts, and is going to take multiple engineers familiar with the domains to go off and provision each stack. One of the aspirations we had in going to the Cloud was that we were getting out of the complexity in managing infrastructure, so how else can we accomplish this? Let’s try and use the new serverless approach, and see how it compares.
Rather than building “stacks” like in the first option, we’re using different services in the AWS catalog (numbers match stacks for option #1 above).
- API Gateway and Lambda — we can build the business rules without a server/load balancer and still be able to handle peak volume. Sizing is addressed with the amount of memory allocated to the Lambda functions, and can be adjusted on the fly.
- Content (MP3, JPG) will be stored in S3, and given that it’s elastic, no sizing needed. Costs are driven on how much content we store, and how often they are accessed.
- Transactional data storage we will stream out to DynamoDB. We don’t need to size the capacity on the number of rows, although we do need to set read/write IO usage as that’s how the product is priced.
- Monitoring is the tricky area as there’s not a single product that provide the same capabilities. We can do some of the work in CloudWatch, but that’s more around system level details. Transaction level can be done with ElastiCache, Lambda, and QuickSight.
- Several different analytic options including RedShift. Management is done through a console provided by the service, or using the API’s. Pricing is more along the lines of traditional host based processing where costs are driven by the number of hosts used.
Now let’s go further in comparing the two options.
Infrastructure Costs — Option #1 vs. Option #2
One key concept to understand between the options is what the pricing model is, and if we’re paying for capacity (option #1) or consumption (option #2). For example, the chart below describes the typical utilization of a servicing application that peaks during the day. In option#1, we’re paying for everything in the box, but in option #2, we’re just paying for what is consumed — the “blue area”. There are features available with IaaS like autoscaling where capacity can be dropped during off hours (see dark line), but even with this done really well, there still will be some amount of excess capacity being paid for that’s not used. If it’s not done well or the workload isn’t a great fit for auto-scaling, the delta between the two areas is huge — potentially 3–5x.
Labor Costs — Option #1 vs. Option #2
As long as we have AWS experts, it’s far quicker to provision infrastructure in option #2. It’s getting us out of managing the low level components that can be time consuming to orchestrate, and each service exposes the key parts for us to tune. Also there’s a labor impact with option #1 when we focus cycles to configure and test the auto-scaling rules. While this might be time well-spent, it drives up the labor effort (thus isn’t free). In option #2, this complexity is taken on by the cloud provider and included into the price of the service.
Option #3 — Containers to the Rescue!
Now after reading this, you might be inclined to improve on these design options using containers and swarm/kubernetes/mesos, etc. to orchestrate. They are a good fit for stack #1, a moderate fit for stacks #4 & #5, and a poor fit for stacks #2 & #3. Containers are a great technology as they can simplify application provisioning and deployment, and drive up the utilization of our infrastructure — narrowing the gap between what’s been provisioned and what’s being consumed in the chart above.
First, let’s clarify our assumptions that if we use ECS, it’s just a variation of option #2 as the Cloud provider is taking on the base infrastructure and tooling. As a customer, if I build containers on top of the base infrastructure, that is a new approach with merits over option #1, but let’s ask ourselves a clarifying question on what our broader goals are as an organization.
Where do you want your engineering talent to spend time on? Do you want them to become experts at building infrastructure abstraction layers, or do you want them to be creating product features?
For me, I don’t work for a Cloud hosting company, and building tools and infrastructure abstraction layers isn’t our core competency. I do want my engineers to spend time focusing on customers, and building features to improve their financial lives. While my team could be building tools to manage containers, this effort takes away from features/cycles that could be dedicated to product development. From my perspective, abstractions on top of infrastructure are great works of engineering, but isn’t that what the Cloud provider is supposed to be doing for me?
Other tradeoffs between Options #1 & #2?
Serverless patterns are still maturing, and while products are evolving quickly, they still have limitations which can require workarounds to mimic the features that can be built with frameworks from scratch on top of IaaS. Over time, I’m assuming these limitations will go away as these services become more mature. There are also some significant shifts in infrastructure approaches that groups must adopt for this new model to be a success. These include:
Network and Access Management
One of the biggest mindset shifts when deploying the serverless model is in how the network is managed, as well as how different components communicate. Most current security models assume a private network namespace that mimics a traditional datacenter. Small blocks of address space are allocated up-front into subnets based on size estimations for the applications. Modeling tools for managing the software defined network are robust, facilitating the complexity built up in legacy firewalls, but there still is an assumption of doing a local network design for the application where IP address ranges are king.
The serverless model assumes role-based access as the authentication model between components, with a network space that is largely hidden from the application. Whomever is provisioning the infrastructure will need to understand the relationships between the components (i.e. which components can talk to others), and be able to author policies that enable this communication. For example, in Option #2, the execution role for the Lambda function will need a policy enabling it to write and read only to relevant S3 buckets and DDB tables. This is how we get to the principal of “least privilege” in information security.
Cross Datacenter & Region Replication Patterns
When building highly redundant platforms that can deploy applications across regions, we need to engineer how the persisted data will be replicated in support of any RTO/RPO service levels. While S3 can easily be configured to transport data across regions, products like DynamoDB & RDS are designed to provide capability across zones (datacenters) and require extra tools and patterns to replicate across regions. When building applications using IaaS, the assumption is that you’re doing it all yourself, and in some cases there are frameworks or tools to install that enable this to be done. Long-term I believe that these features will get built into the services as features by the Cloud providers, and new features are being released every day.
Building efficient infrastructure using a Cloud provider requires good engineering skills and a training plan for associates when they’re getting started. Building robust platforms with just the IaaS services does get complex, requiring an in-depth understanding of proper EC2 instance types, security groups, EBS options, autoscaling, and configuration of ELB’s. It may be easier to convert a legacy sysadmin over to IaaS given that many of the patterns are the same (although most that I’ve met are gapped on the required networking skills).
In the serverless model, the level of detail needed to get started with an individual service is less as the provisioning tooling provides the framework for how to get started, and defaults are provided to simplify the model. It’s only when we start combining many services do catch-up to the level of complexity needed to learn the IaaS model that assumes significant depth on how to assemble a very fungible set of tools to support a specific pattern.
If I were to update the training dialogue from my Cloud Computing class to reflect the state of Cloud Service Models in 2017, the examples I would cite in the PaaS section offerings from AWS including Lambda, DynamoDB, and RedShift. They are an excellent model “using programming languages and tools supported by the provider” while not requiring me to “manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage” that the NIST definition laid out to define PaaS years ago. Getting out of managing the underlying infrastructure enables engineers to focus on the most important part of solution building — being responsive to the customer’s needs.
For more on APIs, open source, community events, and developer culture at Capital One, visit DevExchange, our one-stop developer portal. https://developer.capitalone.com/