Tim’s Take: A re:Invent 2020 Serverless Wishlist
It’s never too early to start planning for next re:Invent!
The AWS Lambda, Amazon API Gateway, AWS Step Functions, SAR and SAM teams, and many others launched an incredible array of features this year that expand the “addressable market” for AWS’s Serverless offerings. (Looking for a field guide to those launches? Check out my 2019 summary.) Collectively, those 2019 releases make AWS Serverless an ideal solution for low latency applications, users of Amazon RDS databases, bursty use of VPC resources, high frequency APIs, and more. As impressive as that list is, though, enterprise and startup developers still turn to “serverful” solutions for many of their applications, and even diehard serverless fans will find themselves wanting some things to be easier. So, with a huge bow to last year’s features, I want to look forward to what the 2020 re:Invent has in store. My wishlist falls into three broad thematic buckets:
- Eliminate the need to run containers due to missing functionality. The 2019 re:Invent launches took a huge bite out of this category, but unfortunately there are still some holes (like gRPC support) where a rational person will still pick containers as the easier solution.
- Tackle net-new markets. Lambda could be a supercomputer…but not without better networking, “reducer” capability, high-speed code customization, and other functionality. Decentralized data is going to be huge, and Lambda is so close to being an ideal platform for building it…but isn’t fully there yet.
- Delight developers. Developer satisfaction never gets old, and some basic capabilities (convenience libraries, copy commands, recursion support, routing, polling, etc.) are still missing. Looking at things like Logic Apps in Azure is inspiring to imagine how AWS Serverless offerings could be both simpler and more convenient.
Ok, enough with the preamble. Let’s start wishing!
How much these matter depends critically on the type of architecture you’re creating, so I haven’t placed them in any special order. But since AWS is fundamentally driven by business needs, I do include thoughts on pricing where relevant.
- 1ms duration granularity.
Why: Five years ago when we launched Lambda, 100ms seemed infinitely small compared to paying a minimum of an hour for an EC2 instance. But today, with Go functions able to do useful work in single digit milliseconds, it feels like AWS is cheating by rounding up to 100ms on every call. Meanwhile, this limits Lambda’s addressable market, because smart devs rule it out right up front for short-lived calls. It’s time to get with the program here and “price cut” by way of tightly enveloping. Don’t wait for Azure or Google to beat you to the punch, AWS!
Price: Unchanged, just more fine-grained.
- EFS Integration in Lambda (and /tmp that scales along with memory)
Why: An infinite disk drive, mounted to an infinite computer? What could be better? Sadly, we still don’t have it, and Lambda’s restricted
/tmp still doesn’t scale up like memory and CPU do when you pay more. As nice (and serverless) as S3 is as a blob store, its latency is high and it isn’t a real unix filesystem…making it impossible to use with a lot of existing library code.
/tmp): Should scale without additional charges — users are already paying more for increased CPU and memory; the drive size should match that for no additional cost. (Save the ability to get a lot more storage for a lot more money for the EFS option.)
- Typechecking — An optional JSON Schema or protobuf definition for Lambda function arguments and results. (Bonus points for full gRPC support.)
Why: The most obvious reason is to remove boilerplate code from Lambda functions and shift it to the service, which is in a position to implement high speed, multi-tenanted JSON validation or gRPC* support for the masses better than any individual customer. It improves security (e.g., you could write IAM policies that prevent invocation of a Lambda function that doesn’t possess strong typing). It also offers a pleasing symmetry to the same feature in API Gateway (which obviously isn’t available for async invocation or when a function isn’t fronted by an API). Static typing offers all the usual benefits you’d get at the language level, such as knowing whether a function expects a number as a JSON integer versus a quoted string without reading the source code or being able to process the result of a Lambda function call without having to guard against malformed JSON (especially when you’re not the function’s owner).
I also can’t stress enough how much the absence of gRPC and protobuf support is costing AWS’s Serverless adoption. AWS teams don’t seem to realize this is the #1 checklist item when a Bay Area developer (and lots of others) make a technology choice. Every time someone fires up k8s, a Lambda PM should interpret that as hearing from a potential customer that gRPC support was missing in Lambda.
Pricing: Ideally calls that fail validation would result in an invoke charge but no duration charge. This also incents developers to add validation: it’s cheaper than paying to do it yourself.
*Yes, there’s a ‘g’ in it, but don’t worry AWS — it’s not really a Google thing.
- Integrate type checking with Event Bridge (and other upstream and downstream event sources and sinks).
Why: Once type checking is present, it affords nice opportunities to integrate with event schema discovery and validation. Example: When another Lambda function is the target of a Lambda Destinations setting, the output of the first and the input of the second can be used to statically verify that the pair will succeed (at least syntactically) even before an event is passed, which is especially useful for infrequent (“cold”) error paths. When wiring up an event directly to S3 or another event source, it helps avoid silly errors, like choosing the wrong function. The schema discovery feature of Event Bridge is a good start, but it doesn’t go far enough.
Pricing: Ideally free :).
- Support the “multiple entry points” design pattern directly through routing and multiple handlers.
Why: Another cool thing you can do once you have some built-in comprehension about the “type” of a function is support patterns like sending multiple events/APIs to a single piece of code. Religious debates aside, deployment is sometimes best done DRYly, which means you may have a single, more complex function to which multiple APIs, routes within an API, or events get routed. In this model it would be great to have systematic support for using a portion of the argument to get to the right handler, instead of having to write, test, and maintain a DIY dispatch table.
Pricing: If you run dispatch code as this part of the function’s execution, then the overhead just rolls into the normal duration charge.
Bonus points for: Pre-invocation Lambda hooks (also see “BYOA” below) and post-invocation hooks (aka Destinations) for sync calls.
- Polled event sources beyond SQS, Kinesis, and DynamoDB Streams.
Why: These event sources are part of Lambda’s super power (though see below for what I consider to be the crippling problem with Kinesis and how to address it). But no matter how much we might all turn up our noses at it, polling is still a thing in the real world. APIs, databases, queues that aren’t SQS, streams that aren’t Kinesis…there are a LOT of data sources that have to get polled. Lambda’s timer events, ability to scale, concurrency controls, etc. help a lot here…but polling is still super manual. Why not expose more of that polling super power as its own (extensible and customizable) service for us? See Microsoft’s LogicApps for a good idea of what this could look like (and how to make common tasks feel simple).
As a customer, I’m not hung up on *where* this happens…Lambda’s built-in pollers (like Kinesis and SQS), EventBridge, a new poller service, etc.
Pricing: Polling gets done mostly for free today, despite having very real costs. I’d expect to pay something for the privilege of polling a 3rd party event source.
- Idempotency protection / exactly once semantics / results retrieval / verification support for Lambda functions / memoization option.
Why: A common need in application code is to perform a non-idempotent action at most once. One of the most annoying characteristics of Lambda for developers is that it’s an at-least-once execution model, although most of the time it works in an exactly once fashion. This is a classic “garden path” problem…it’s e-x-t-r-e-m-e-l-y tempting to treat Lambda as if it were an exactly once service. One of the reasons people don’t work harder to “do the right thing” with Lambda is that it’s super annoying: Either you have to create (and pay both literally and in the sense of increased latency for*) a one-action Step Function to do this for you or you have to craft a DynamoDB table with auto-expiry and use it to record a moniker for each application invocation to protect against duplicate executions. Who wants to go through that hassle? This should be as simple as providing an optional JSON path for each Lambda function that indicates the route to a unique id along with an optional “de-duping” window length (which could default to 5 minutes). Bam! Idempotency protection for the masses, with no mistakes, and with way less cost than the DIY approach requires…after all, what’s the point of paying to persist this info to disk when it doesn’t need to live beyond a few minutes?
This is also another place where typed function (#1 above) helps, because you can then describe the UID to uniquify on as a JSON path to a known (numeric or string-valued) datatype, or as protobuf metadata.
Closely related to idempotency, and super useful in building decentralized applications, is verification. This is essentially a generalization of duplicate invocation protection: Instead of just storing the ID of an invocation (and thus implicitly the Boolean value that indicates it has been executed previously), store the result of the function. This makes it easier to use Lambda to implement concepts like a smart contract, which needs to execute exactly once but which might need to have its results verified multiple times, using the cloud vendor (AWS) as a trusted authority to retrieve the outcome of the contract execution. It also makes it possible for Lambda to offer memoization (result caching) as a feature, albeit with more complex (storage-based) pricing implications.
Pricing: Exactly once semantics and verification both require storing additional information in a durable (though not persistent) form for a limited window of time, which likely requires a higher invocation charge per call (and one that gets larger the longer the information must be retained).
*Note that the more recently released Express Workflows don’t help here, since they also have at-least-once semantics.
- Shutdown / Batching Hook
Why: Startup initialization is cleanly integrated with language mechanisms in Lambda, making it trivial to express “do this just once per function, before the first invocation gets to proceed”. Provisioned capacity increases the time between initialization and first run, but stays true to the concept.
Unfortunately, there’s no “shutting down” companion to the “starting up” hook. And even though it’s 2020 as I write this, fat client libs, stateful protocols, and resource locking are still very, very real in enterprise code. It’s not just old code, either — even “serverless” design patterns are impacted by the lack of a shutdown hook, because any time you want to batch downstream calls *across* Lambda invocations you’re out of luck — there is simply no reliable way to do it that doesn’t require using S3, Dynamo, or some other aggregator that often becomes self defeating in its cost and complexity.
There are a bunch of ways this could be retrofitted; an optional hook, set by a context method, is one. (Sending a special event would require everyone to start writing a new, top-level condition, which feels too intrusive.)
There are some corner cases (e.g., a function that crashed will probably not be in a position to execute its shutdown hook) but now that Destinations exists, there’s even a natural way to deal with that.
Pricing: Call the delete hook an invocation and charge for the duration it runs, just as if it were “normal” event invocation.
- Ephemeral Functions
Why: Let’s say you have a Lambda function that wraps ImageMagick or some other image processing library. You have two directories in S3, “small” and “large”, and you want to thumbnail source images to different resolutions depending on the directory name. Easy breezy: With just two directories it’s easy enough to either code a conditional into a single Lambda function or just create two different Lambda functions (optionally by sharing a Layer for the common image transformation library). Problem solved.
But what if you had a *lot* of directories? Such as one per user for a large b2c app with millions of users, each one of whom gets to customize settings differently for their images? You still really only need one function, but you need a way to associate its arguments with the user. We could give the Lambda function access to a database of users and their configs, but that’s a lot of exposure that doesn’t seem truly required. Couldn’t there be a way to get those arguments passed in, leaving the image transform function stateless and avoiding having it read a database it doesn’t really need to see?
The real problem here is the hard split between Lambda’s control and data planes. In this case, we have a single, “master” function that we want to create once (the control plane part) and then partially evaluate by setting some but not all of its arguments many times over. These partially evaluated functions have to be created on the data plane because there could be many of them. The cardinality has to be massive: E.g., it should be possible for every object I store in S3 to have its own ephemeral Lambda function as an event handler without ever “running out of space” and needing to get one’s Lambda limits increased.
How could this be done? One way would be to expand Lambda aliases to include partial evaluation monikers, stored in DynamoDB. When I call foo:bar(), Lambda looks up bar, sees that it wants the first argument to be 5 and the second argument to be “yellow”, and then proceeds to call foo(5, “yellow”, <the rest of the args>).
Pricing: Since a little extra work is required, it would be reasonable to charge a premium on top of the normal invocation fee to use an ephemeral (partially evaluated) function. Then again, it saves on the increasing expensive “real” control plane operations, so perhaps it could be a wash?
- Affinity / Poller “State”
Why: To motivate this request, let’s imagine a really simple (if unfortunately hypothetical) model for processing a sharded event source, such as Amazon Kinesis. Suppose we’re processing some US domestic data, such as ecommerce sales, and we have 50 shards, one per state. We want to create a dynamic business dashboard that shows sales in the last hour on a state-by-state basis, say by illustrating it on a color-coded map of the US for the product specialists. We’d also like the code to be easy…ideally, it’s as simple as “hourly_sum += record.sale_amount” with an end-of-time-window hook that writes hourly_sum to the database powering our dashboard. Oh, if only it were that easy…
Lambda’s existing Kinesis poller integration handles a lot of the annoying work of dealing with streaming data, but it leaves two key problems unaddressed. With no shutdown hooks, there’s no way to tell a function that it’s done aggregating data. With no affinity, there’s no way to aggregate (batch) anything locally, because the next batch of records from the same shard might get processed by a different instance of the same Lambda function. So today, building this “simple” running counter means turning every Lambda call into a DynamoDB read & write. It feels downright silly to have to use one persistence mechanism to aggregate another.
With just shutdown hooks added, you could hack up a solution: Use the built-in Lambda Kinesis integration as the “frontend” while sending records to shard-affinitized Lambdas on the “backend”, each with a concurrency setting of 1. This gets very complicated once you have to start dealing with shard splitting and combining, limits on the number of Lambda functions (and the overhead of creating new ones), and so forth. To really work well, it needs to be a first-class, fully built in capability.
Pricing: This is an extra constraint for Lambda’s scheduler, but doesn’t require doing anything more while the function is running. So hopefully no upcharge at all, and worst case perhaps a slightly increased invocation charge for that additional tracking.
BTW, we talked about this in the form of Kinesis, but there’s nothing special about polling event sources; this technique would also be super useful to create “reducers” of any type where a subset of events need to be routed to a subset of instances in order to execute some kind of aggregation function unique to the shard. It’s way better than exposing addressable instances outside the service, while still providing a lot of the semantic richness that people believe can only be achieved with ‘stateful’ services and containers (or servers) today.
- Lifecycle events on the control plane.
Why: It’s a bit ironic that the ultimate event handling service doesn’t itself generate useful events. Function creation, state changes, etc. should all emit events. This would make it possible to do things like hook in compliance checks on code updates, security verifications for code changes, etc. (See above for lifecycle events on invokes, aka pre- and post-hooks.)
Pricing: Should be free (of course the functions that handle the events will be charged as normal).
- Bring Your Own Agent (BYOA)
Why: There are a lot of uses for “sidecars” that run alongside code, including logging, performance monitoring, security checks, and more. BYOA would also make it easier for serverless monitoring and security companies to offer products that are both easier to use and that perform better. Language and lifecycle considerations aside, the big needs for such agents are:
- Clean initialization (easy — already exists)
- Clean shutdown (could be a minor extension to shutdown hooks; see above)
- Heartbeat / checkpoint events (relatively easy, but net-new functionality)
Pricing: Assuming shutdown and heartbeats charge as if they were normal invocations, this could look just like normal pricing.
- From “Destinations” to “Continuations”
Why: Ever hear someone say, “If you run out of time on a Lambda function, just have it call itself recursively”? Except…that’s problematic — how do you represent how much progress has been made? What happens if a function dies for some reason…how does the system recover, especially if it represents a lot of work/time? What if a transient outage causes a single invocation to fail in the chain?
It’s interesting that Lambda already has a model for handling this: Kinesis and DynamoDB streams present a function with a batch (array) of events, and can re-present subsets of those arrays to handle partial work (see “Bisect on Failure” in the Lambda docs).
Why not generalize this? Allow *any* invocation to pass in and/or out an array of events, along with an optional continuation (a function to call if the current function doesn’t finish, just like a Lambda Destination). With static typing and ephemeral functions, this should be pretty straightforward. As with many of these suggestions, static typing is an effective prerequisite to getting the service to help out, since it needs to understand the array type and how to send it to the continuation.
Pricing: Should be free; this is basically just passing the result of one function to the next in a more controlled manner that the Lambda infrastructure can help track.
- Lambda Spot.
Why: I’m pretty sure that Lambda has some spare capacity at 3am in any given region. Why not offer a price break to incent usage during off-peak hours with dynamically computed discounts?
Pricing: Hopefully lower!
- Serverless Networking.
Why: I’ve talked about this extensively elsewhere. Streaming data, using Lambda to build peer-to-peer networks, high-speed data processing, gRPC between Lambdas (especially to serve up computed or cached data), and so on.
Bonus points for: More polite (i.e., not forcibly symmetric) managed NAT inside VPCs.
Pricing: Free :)
- Intra- and x-region COPY command
Copying Lambdas is annoying. Surely this could be more convenient.
Pricing: Free — it’s a control plane thing.
- “Global Lambdas” (aka x-region autocopying)
Why: Sometimes you really want to copy a function and then make a little tweak, and it’s a PITA to do that today, especially if you’re been using the UI. Similarly, keeping functions consistently deployed across regions is annoyingly hard, requiring CodePipeline and CloudFormation and then straggler control on top of all that…can’t this just be a setting? The capability already exists (see SAR public apps and Lambda@Edge) but is cleverly hidden from customers today :).
- Encoding payloads in transit.
Why: Because it’s the enterprise and it’s 2020 folks. Service-to-service payload encryption should be “1-click” (or its programmatic equivalent). IAM should be expanded to make it possible to set policies that prohibit calling/being called by functions/event sources/other services that aren’t on-the-wire encryption configured. Frankly I think all newly created functions should get this turned on by default.
Pricing: Free (because mostly duration charges, and where applicable, data transfer charges, will address it).
- More Layers containing popular language libraries.
Why: Take the top 50 libraries in every language and bundle them into Lambda convenience layers. Yes, yes, yes — I know, it’s a terrible idea because versions. Horrible. Inconceivable, almost. So, can we have it please, and without all the damn whining? Thanks, love ya. Seriously — if Amazon Linux can exist, then this can exist. AWS, stop fussing and make Lambda easier to use.
Pricing: Free of course.
- Additional support for building decentralized services and applications.
Why: Data and code that has to be shared among / across companies and organizations is increasingly common…and super hard to build on AWS unless one account can own everything, which often isn’t viable. Lambda is in better shape than most services with its immutable versions (and verification support would help even more; see Idempotency above).
But just like peer-to-peer networking requires capabilities not needed in client/server connections (like NAT punching), decentralized software requires capabilities that only the cloud vendors are in a position to offer. Some additional features that would make Lambda even easier and more powerful in building decentralized applications and services are:
- Fix cross-account resource policies — Inside an account, Lambda resource policies are nicely fine grained. For example, as the owner of a function you can specific specific workflows, APIs, and other functions that can call it. But across accounts, where you think AWS would care about granularity and security even more, it’s a disaster — to enable cross-account use of a Lambda function, you have to open it to *ANYTHING* in the calling account. Hello, enterprise security anyone?
- WORM functions — make it possible to mark a function so that it can’t be deleted, just as immutable data in S3 is handled. (To be useful, this fact, and its expiry, must also be easily and programmatically shareable across accounts as a critical piece of metadata in addition to the actual data.)
- Equivalent functions — create a namespace where a function — if it exists in an account — must be identical in its code & config to every other AWS account. This makes hash checks to convince yourself that you (or someone else) is running the “right” code a thing of the past. (To put it another way: This is like a provable COPY command that can cross accounts, but without the operational hassles of actually doing the copy; you just have to say you want to allow a subdirectory of the namespace to be “mounted” into your Lambda account.) Ideally this doesn’t count against your storage limit! Also, super cool feature for SAR to consider…
- Permission monotonicity and immutability — a bit you can set when creating a function that allows its permissions (both resource and role) to contract but never expand. (This is really a broad ask for all IAM, but I’ll list it here for completeness, as having it even just for Lambda would be super beneficial in building decentralized databases.) Note that this is very different than permission boundaries: There is no explicit document required, and the owner of the function / account cannot subvert the setting; all they can do is delete the function (unless it’s been WORMed as described above). Similarly, a set-on-create bit that makes the role and resource policy of a function completely immutable, even by its owner. (The function can still be deleted unless it’s WORMed.) These make it possible to give another company/organization durable, predictable access to a function with the confidence that not only won’t its code change, but neither will its capabilities. It’s a critical set of guarantees that makes AWS a platform for decentralized systems and application coding.
Pricing: Free. This is all security related or control plane magic.
- Serverless redis
Why: Low latency storage that can seamlessly scale in an impedance-matched fashion with Lambda’s compute would be a game changer. EFS and a “serverless” redis would make it possible to convert existing enterprise applications to Lambda that just simply can’t be moved out of containerland today.
Pricing: Great question, but let’s focus on having it exist first. Hopefully just by making it 100% utilized by definition (like S3), it will be super cost effective compared to the inevitably overscaled “infrastructure redis” of today.
Bonus Round: Everything else (outside of Lambda) that would make Lambda better
- Built-in support for gRPC/protobuf (similar to how websockets work).
- Put humpty-dumpty together again: The v1/v2 API split is a mess.
- Add resource policies to API Gateway websockets.
- Make it possible to get the AWS account number when IAM auth is in use.
Event Bridge / CloudWatch Events / SNS
- Ordered events (the SQS FIFO model would be fine). C’mon already…not everything is arbitrarily reorderable and at-least-once-able.
- “Auto-region” copying for private as well as public apps.
- Lifecycle events.
- Lambda@Edge done right aka CloudFlare Workers aka “Serverless Wavelength”
- Immutable permissions (can be set only on create; irreversible even by owner)
- Monotonically contracting permissions (set any time, but irreversible once set)
- A serverless backend
- Seamless interoperability with VS Code
- A shell mode that uses Lambda to run an Amazon Linux bash shell (so I can stop having to own and maintain an EC2 instance just for the privilege of running AWS commands)
Other AWS Services That Lambda Connects To:
- X-Ray support
- Resource policies, especially fine-grained x-account ones
- In-transit / on-the-wire encryption support
- Built-in chaos testing