Thanks for the response, Tim!
I tried to keep the article concise, probably at too high a cost in clarity. I think maybe it deserves to be a whitepaper with fully fleshed-out use cases to really enumerate the breadth of pain points such a service would solve.
There are two parts to my response; more detail on the pain points, and then addressing your questions specifically.
First, some terminology:
I’m talking primarily here about service discovery within applications (i.e., between microservices), rather than discovery for clients of the application, which most likely use some other discovery mechanism to look up the endpoint for the application-level API (note: a future article covers how using a single client-facing API Gateway to a set of independent microservices has its own pain points).
There are at least three reasons to have multiple copies/versions of a given piece of code deployed at the same time. If we have environments like QA, staging, and prod, such environments may be part of a pipeline (CI/CD or not). We may have multiple pipelines, say for long-lived development branches (at iRobot, a new product may need its own cloud software before it stabilizes enough to merge with other projects working on updates to existing robot and cloud software). We may also have sandbox environments that are disconnected from any pipeline. Within an environment, a deployment is a “constellation” of Lambda functions and other cloud resources that interact with each other and we may have multiple deployments within an environment, in service of a blue/green or red/black update. (A note on this: while FaaS often gets talked about in terms of “functions as the unit of deployment”, what I’ve seen is more that constellations are the unit of deployment — e.g., a CloudFormation stack — because any given function is coupled to resources it depends on and perhaps other functions it calls)
Let’s term a new deployment within an environment evolution, a deployment to the next step in a pipeline promotion, and the existence of multiple pipelines and isolated environments separation.
Next, the pain points:
We have found that stages in API Gateway and versions/aliases in Lambda really only support the use case of promotion without evolution, and even for that, there’s still some friction. It’s not possible to use stages or aliases to perform phased or incremental rollouts; it provides indirection, but the pointer is only update-in-place. I wrote an article recently about how this could be improved.
Additionally, the lack of upsert in the Lambda API means that separation is not well-supported, either. If multiple separated environments use the same Lambda names, but separate themselves using aliases, the first environment must know to create the Lambda, and when tearing down an environment, it must be known whether other environments still exist that use that Lambda name (if so, it can’t be deleted).
API Gateway suffers from similar restrictions, not least of which is that there’s only one “staging” area to set up an API before deployment, leading to crosstalk if multiple environments are trying to create their own resources for their own deployments/stages at the same time. Swagger import is more transactional, but only works at the API level, not the deployment or stage level.
I think this all applies whether “deployment” means a single microservice, or a collection of microservices that form an application. For an individual microservice deployment, all the resources within it are probably known at deployment time, and so it’s most likely feasible to wire them together directly (e.g., putting resource ARNs in env vars), and it is probably possible to eliminate circular references (which make deployment-time injection impossible). But for an application, each microservice would be deployed independently, and circular references between microservices may be necessary — which means that for any given microservice, there may be information it needs that is not available at deployment time. This means that some information has to be looked up at runtime.
(Aside: using API Gateway as the interface between microservices in an application has its own issues, which I’ll cover in the next article.)
This runtime lookup, then, is the mapping from logical id to physical resource id. And while the environment or deployment id could be injected into a Lambda at deployment time, and used to index into an external store, this requires the Lambda code to be more aware of infrastructure/operational details than I think is useful. I would rather the lookup use the “who” (i.e., IAM) or “where” (e.g., via tags) of the calling Lambda to do this indexing, because this can be mapped very directly to the deployment and environment concepts above, and leaves the Lambda code completely agnostic to it.
So, to finally answer your questions:
I could probably make this service using Cloud Directory — its caller-identity features are around permissions, rather than values, but I’m sure I could figure out a way to cobble it together.
However, I could definitely build a service for runtime lookup on top of API Gateway! The request context provides the caller identity (for any IAM, unlike the Lambda context, which only gives the identity if the invoker is Cognito — would love parity there), and that could be used to index into data stored in DynamoDB, possibly even skipping a Lambda call by using the service integrations (or even, theoretically, in a mock integration on API Gateway itself!). I could use a custom domain to make this API Gateway endpoint well-known. This solution would work for server-based architectures as well!
But I shouldn’t have to. While the architecture of the solution is straightforward, building the functionality for administration of the data, returning nondeterministic values for incremental rollouts, etc. is complicated. The fact that server-based architectures on EC2 use Consul, Envoy, Istio, etc. indicates to me that this is a need in that space unmet by AWS; at a fundamental level, why would serverless architectures not need a direct equivalent to this architecture component? Everybody I talk to has their own version of this in their application, and making it ourselves doesn’t add value for us.