Skip Lambda, Save Data to DynamoDB Directly Using API Gateway; Process Later With Streams
AWS’ API Gateway allows you to directly connect it to/proxy many other AWS services. This article discusses doing this with DynamoDB, as a way to create an API that adds data to DynamoDB, without needing to use a Lambda function. There are existing AWS docs on using API Gateway as a proxy for DynamoDB, however, as usual, those only cover how to do this in the AWS console. In particular, I’ll show how I set this up using the Serverless Framework (or CloudFormation, as the bulk is really just CloudFormation code), and how you transform the web request’s JSON so it can be directly
PUT into DynamoDB. Finally, I’ll talk about how to then do post-processing of the data via DynamoDB Streams.
The use case I have is an authenticated web API that takes in a potentially significant volume of events from mobile devices. This data will be stored in DynamoDB. As an additional constraint, the mobile app is sending via regular HTTP web calls, and doesn’t have the ability to use GraphQL (i.e. AppSync isn’t a possibility for this case). Finally, I want this particular API to be simple and very fast, and all the (time consuming) processing of the data will be done async. Thus, we can simply have the data come in via API Gateway and get injected directly into DynamoDB (with some basic data transformation, and integration of the user’s ID).
This alleviates the need for a Lambda, and avoids the cost of that. Not that Lambda is that expensive, but if this does wind up scaling to say millions (or hundreds of millions) of events per day, then that will be a meaningful savings. Furthermore, this is more maintainable and a simpler architecture, as it’s one less component to build and maintain.
Update 4 Jan 2021
In the full gist (see link below in first sentence of “Show Me Alredy”), I added a CloudFormation resource for
AWS::ApiGateway::Deployment as I hadn’t had that in there, and without it, your API won’t actually get deployed!
Update 25 Nov 2020
A quick update since I originally published this story. Ben Duong pointed out a Serverless Framework plugin Serverless Apigateway Service Proxy. However, it doesn’t support DynamoDB’s batch updates, so cannot be used in this case. I’m also not sure on how it handles auth needs. However, if you are simply taking a single event/record into your API, it should cover it.
Show Me Already
A full serverless.yml config file for this can be found in this gist. Ultimately, the bulk of this is CloudFormation within Serverless Framework config (if there’s a plugin I missed, or some more direct Serverless way to do it, let me know!). I refer to line numbers from this gist below. The key parts are:
- Cognito user pool (optional/may not be needed for your case) and IAM policies
- DynamoDB table configuration
- API Gateway API configuration
- API Gateway VTL mapping template
For this example, the JSON body in the POST request to this API looks like the following:
The Interesting Parts
To me, the interesting parts for this whole thing really come down to how to do the VTL mapping template (i.e. take an incoming HTTP request’s payload and transform it to what DynamoDB needs to do an insert, and how to get the Cognito user ID and include that in the data (since all the authentication is happening “automatically” for you via API Gateway’s Cognito integration). Well, and of course how to do this all in code/Serverless instead of via the AWS console.
A Note About DynamoDB Batches
A key thing to note is that we use batch writes for Dynamo. These are limited to 25 items at a time. As such, our mobile clients are limited to sending events in batches of 25. But, the key is that it’s batch, even if it’s a batch with just one event. You’ll see more on this below with the VTL template iterating the incoming events.
First up is Cognito (line 51). If you do not need authentication on your API, you can skip this. There is a fair bit of setup, at the beginning of the
resources section to configure a user pool and the policies needed for this.
Next, if you look in the “API Gateway + VTL template to put events into above DynamoDB table” section (line 193), you’ll see a
YourProductAPIAuthorizer section. This sets up the use of Cognito user authentication for the API Gateway API.
This is standard, and you can find plenty of docs in Serverless or CloudFormation for creating a DynamoDB table (line 170). I recommend checking out the Serverless DynamodB Local plugin as well, which makes it easy to use a local DynamoDB for testing. You’ll see the table creation under the “DynamoDB events table” comment. This is a very simple one with just a single PK (UserID) and SK (TimeUTC), but sufficient for this example. You’ll note that it is configured in full serverless mode via the
BillingMode: PAY_PER_REQUEST line.
The meat of things :) This is under the comment “API Gateway + VTL template to put events into above DynamoDB table” in the
resources section (line 193).
It starts off with an IAM role setting up what actions API Gateway is allowed to do with DynamoDB, and specifically just for the
EventsTable. In this case, it’s allowing 5 actions, with the most important being the
BatchWriteItem action, as that’s what will actually do the insert (of multiple events in this case).
Next you’ll see the
Authorizer (line 230). You’ll notice in that, the
IdentitySource: method.request.header.Authorization which means the API uses the
Authorization header. More details can be found in the CloudFormation docs for the API Gateway Authorizer.
EventsResource (line 253) item, which defines the URL path of the API,
events in this case. Thus, the API URL path is
Following that is the real meat of the API, the
EventsAPI resource (line 257). This defines the Authorizer to use, the HTTP method (POST, line 265), and then the really interesting part, the VTL template,
RequestTemplates, that maps the incoming JSON to a DynamoDB request:
After that comes the
MethodResponses . These were a little confusing and unclear on how to set up at first. The
IntegrationResponses handles the proxy/request to Dynamo, and is mapping its responses for API Gateway, which then get mapped to the
MethodResponses which API Gateway uses for the actual HTTP response.
A few key notes:
RequestItems(line 277) is the root element of a DynamoDB BatchWriteItem operation. As mentioned above, you can have at most 25 individual requests within this (these can be different, e.g. you can mix Put and Delete, although for this obviously we’re only doing
PutRequestitems). We’re obviously not enforcing this batch size limit here, which is one downside — you’re having to rely on your clients who call this to behave properly. When you do send more than 25, the Dynamo request will fail, and it’ll fail the API Gateway call/return an error. This is something to consider when doing these proxy style API’s, as you clearly get less in terms of how you can handle errors and how you might want to respond in such a case. I believe there is likely a way with the VTL template to potentially map it differently, or maybe immediately return an error if the count of items is higher, but I haven’t explored that yet.
- A VTL
foreachloop (line 279) is used to iterate over the incoming list of events, and map each one to a
PutRequest. Note that the incoming events are just a simple JSON array/list of single level of attributes, but if they had nested elements, you’d just use this same dot syntax to traverse deeper as needed.
- The user’s Cognito ID can be extracted from the
$context.authorizer.cliams.subelement (line 283). But, as you can see, this is inserting additional data for DynamoDB that wasn’t part of the original HTTP request’s JSON, as well as showing how to get to the Cognito data.
TimeUTCelement (line 284) is a string (in DynamoDB) and the incoming JSON already has it as standard ISO format so it can just be set directly like this. It is used as the Sort Key in this table, so having it in ISO format makes it properly sortable.
- The rest of the elements are just a straight mapping from the incoming JSON to the value for the DynamoDB attribute. Note that of course you can have different names for the DynamoDB attributes vs. the JSON attributes, such as
sensor_namegets stored as
- Lastly, a subtle one. Note the code
#if($foreach.hasNext),#end(line 293). That’s a way to add the trailing comma in after each item in the batch of items for the DynamoDB request. Dynamo is particular though, and does not allow a comma after the last item, which is why we have this wrapped in the conditional (i.e. only add the comma if there will be more items after it). Without this, DynamoDB will fail your request.
Post-Processing via DynamoDB Streams
While not required, as mentioned early on, I am doing asynchronous post-processing of these incoming events. This is handled via DynamoDB’s streams. This setup involves a Lambda function that listens to the DynamoDB stream which provides all events from Dynamo (insert, delete, update, etc.). Thus, in my case, for this post-processing, you do need to filter to just
The post-processing we do takes longer and is fairly involved, and thus I wouldn’t want that being done synchronously on receipt of each of these events (nevermind on a batch of 25 events). Therefore, this architecture creates a very simple API that just worries about storing the raw data. Clients either get the format of that data right or they don’t, which is about the only error they can get from the API. Then later, we process these events (which is more time consuming).
You may be thinking — wait, you said we eliminate the need for a Lambda, but now you have one doing the post-processing. True! But you may or may not need that step, AND, the key here is that you are avoiding doing potentially time consuming processing during the API call (thus creating a slow/long response time for your API). Furthermore, with the streams API, you can fetch up to 1000 records for a single Lambda invocation (vs the limit of 25 on the incoming/batch write aspect). Therefore, you potentially could have 40x fewer Lambda invocations (if you can process all 1000 records in the 15 minute Lambda time limit). That said, the real key here for me was not doing the heavy processing we do during the API call, keeping the API itself very fast and having the fewest possible error scenarios.
An interesting note about this as well is that the way DynamoDB streams work, they are sharded by the PK (primary key), so you can be sure to get events, at least by primary key, in the proper order. i.e. in my case, I’m sure to be processing events for a given user in the order they occur. It is a nice benefit and one to consider when looking at alternatives such as queues and SNS, etc. See this AWS blog article, “How to perform ordered data replication between applications by using Amazon DynamoDB Streams.”
Pros and Cons
Obviously not all your APIs can or should be built this way. But, it’s definitely an interesting ability that AWS has provided. Combining this with DynamoDB streams to post-process these is a great option as well.
The primary cons of this, in my mind, are the limited error checking and data manipulation of VTL templates vs. a full code solution. If this is a public API where you have no control of the clients making calls, that error checking alone may be worth inserting a Lambda. The other con to me is use of VTL templates in general, and testing this. I’ll be the first to admit that this is more difficult to test. That said, I’ve found that this is one place the AWS console is handy, as they have a way to directly test such VTL in this setup. I’d rather have a unit test in my code, but at least there’s something.
The pros for me are about the overall architecture and fast API responses for this particular use case. Due to the heavy processing involved for these events, I would have to do that async regardless. This architecture simply leverages the abilities of the AWS platform, and makes the API itself very simple.
If you have better or different ways to orchestrate this in Serverless, or other suggestions, let me know!