The backbone of a serverless app: Lambda functions and DynamoDb tables (Detecting Paris’ locked bicycle stations 2/5)

Published in

CodeX

13 min readAug 2, 2021

This series of articles is about me spending way too much time trying to solve a niche problem (detecting locked bicycle stations in Paris, see Part 1) while learning how to use the AWS Serverless stack. To find the other articles, skip to the bottom of the page.

Fetching the Velib’ data

In Part 1, To be able to detect the locked Velib stations, the first step is to ingest the Velib API data. As we have seen previously, this API exposes two main endpoints :

/station_status.json which returns, in “real-time” (refreshed every minute) the content and status of each station.
/station_information.json which returns the characteristics of each station (geo-coordinates and name).

To accurately get the evolution of each station content, I need to call the first endpoint every minute. The other one can be called less often as the characteristics rarely change. To do so, I’ll create two Lambda functions: FetchStationsContent and FetchStationsCharacteristics. The corresponding data will be stored in DynamoDb tables. To “trigger” my functions every minute (or hour), I will use EventBridge rules.

Here is what it would look like :

First function

Lambda functions can be written in virtually any language. For some (like JavaScript or Python), you can even code your function straight into the AWS web console. That’s fine for a quick prototype, but not really recommended for bigger projects.

The web editor for a Javascript function

Most of the code for the first Kafka prototype was written in Java. But, I will use Angular for the frontend and I want to use a single language if possible, so I will write my functions in Typescript. TypeScript is not officially supported by AWS Lambda, but JavaScript is (you can choose a NodeJS backend for your function), so I will simply need to transpile my TypeScript code into JavaScript and upload it to my Lambda.

Let’s start working on the first function (fetching the stations’ content). I’ll first define some types representing the domain objects.

Then I will create a basic REST client using Axios.

Simplified Axios client

The Lambda function itself simply uses the client and logs the content. I will save it in the database later.

The function lambdaHandler is what will be called by the Lambda infrastructure. After that, it’s a simple matter of transpiling this into JavaScript, zipping the resulting folder, creating the function, and uploading the code to it using the AWS web console.

From there I can manually run the function to make sure everything is ok. The log of each run will be visible in CloudWatch (the AWS logging and alerting service). I can see that I successfully fetched 1410 stations’ content. Perfect!

To run the function every minute, I need to go back to my function’s page and follow the procedure to add a trigger. I will then choose to create an EventBridge rule and specify a schedule expression. CRON expression can be used, as well as simpler “rate” expression.

A simple EventBridge expression for “once every minute”.

If I go back to my function’s page, I can see that the EventBridge trigger has been attached to it. The function will now run every minute.

My Lambda details page on the AWS web console

Perfect. Now I have a very simple TypeScript lambda that will automatically run every minute and fetch the current data about each Velib station. Next, I need to store that data somewhere.

DynamoDb, a serverless document database

DynamoDb is a key/value and document serverless database. The data is stored in tables. Each table contains items, that are simple JSON payloads. For each table, you must define which of the fields of the items will be used as keys, to uniquely identify each item. You can choose between two options:

a single partition key: for example a country code if you store countries' characteristics. This is like a simple primary key in SQL.
a partition key and a sort key: for example a stock identifier and a timestamp if you store historical prices for different stocks. This is like a composite primary key in SQL.

By default, DynamoDb allows you to perform CRUD operations based on the keys you chose for your table. You can also scan all the items or run queries with more complex filters that will, under the hood, also scan the items or leverage the chosen keys if possible. As you pay for each item scanned or queried (more on that later), it’s important to avoid doing full scans of your table. And so, having a sort key is especially useful if you want to be able to fetch your items in a given order. For example, if you want to load the “last” item, you could use the insertion timestamp as the sort key and do a simple query ordered by the timestamp with a limit of 1. Without a proper sort key, you would have to scan through all the items and sort them manually in your code.

As the choice of keys will impact how the items are physically stored, changing which fields are used as keys is not possible once you have created a table. You would need to recreate a new table and migrate your data to it.

But, you can define additional indexes if you often have to query your items using other fields than the keys:

Local Secondary Indexes (LSI) allow you to define an additional sort key, (but you must keep the same partition key).
Global Secondary Indexes (GSI): allow you to define a whole new pair of partition and sort keys. In fact, a GSI is almost a duplicate of your table, stored differently. That means that a GSI can lag a little behind your main table and is considered only eventually consistent with the main table.

DynamoDb also offers additional quality-of-life features, like automated removal of old items. For that, you need to add a field to your items that will contain the timestamp at which the items should be deleted.

A ‘stock’ table with partition and sort keys along with a time-to-live field for automated deletion a year later.

DynamoDb pricing or the way of the constant flow

As a serverless service, the DynamoDb pricing model is not based on the underlying server instances, but on how you use the service. You pay for the quantity of data you store, and how many read and write operations you perform. More precisely, you have two choices :

on-demand: you pay for the actual number of reads/writes you have done this month
provisioned: you define in advance how many read/write per second your table will need. If you go over, DynamoDb will throw an error stopping you from overconsuming.

On-demand seems way simpler but is also more costly. For the same number of reads/writes, a provisioned table will cost about x6 less. Furthermore, the free tier for Dynamo only concerns provisioned tables, so that’s what I will use.

The free tier for DynamoDb and Lambda. Note that only “provisioned” tables are covered.

So, how does this provisioning stuff work, concretely?

For each table, you need to define the numbers of RCU (Read Capacity Unit) and WCU (Write Capacity Unit). And you can also set up automated scaling of those two properties if you want them to be adjusted automatically up to a given limit.

Both RCU and WCU represent the number of Reads and Writes your table can sustain per second. To be more precise:

1 RCU = 1 consistent read of a 4kb item per second. Or 2 non-consistent reads of a 4kb item per second (DynamoDb use a non-locking replication process under the hood, which means a write operation can take a few seconds to propagate to all the nodes in the cluster. So, depending on your use case, you may absolutely need consistent data, or you may be fine with getting stale data). If you read items with a size of more than 4Kb, you will need 1 RCU for each additional 4Kb block.
1 WCU = 1 write of a 1kb item per second. If you write items with a size of more than 1Kb, you will need 1 WCU for each additional 1Kb block.

Example:
If you read (consistently) two items with a size of 8Kb each, every second, it would take 4 RCU. Reading (inconsistently) the same items would take only 2 RCU. Writing them would take 16 WCU.
Sadly, if you read/write an item with a size of 0.01Kb every second, you will still need at least 1 RCU and 1 WCU as 4Kb and 1Kb are the minimal size of a read and write operation respectively.

But, planning your reads and writes every second seems near impossible. That’s why DynamoDb allows “bursts” of reads and writes. The operations you don’t use can accumulate for up to 300 seconds (5 minutes).

Example:
With 1 RCUs, if you don’t read anything for 2 minutes (120 seconds), it means you can read (consistently) 120s * 1 RCU * 4Kb = 960Kb worth of data from your table in a short period of time.

So, if you know in advance the amount of data that will be read from and written to your table about every 5 minutes, you should be able to select the perfect number of RCUs and WCUs. In my case, the data represents an (almost) fixed number of stations and is fetched every minute, so that should be doable.

Designing for cost

With only 25RCUs and 25 WCUs for the whole application, I really need to optimize each table.

Let’s look at the first table that will store each station's content (number of available bikes and status) every minute. How I structure the data (and which keys I choose) will impact the number of RCU/WCU I’ll need.

So, what keys could I use?

#1 Using the station code as the partition key and the datetime as the sort key:

Seems pretty obvious. Accessing a specific station’s content becomes very simple. The problem is that I have ~1.500 stations, and each station’s content represents a very small payload (about 160b). I’ll need to perform ~1.500 writes each time I want to update the content of all the stations. Even with bursting (I only fetch the content every 60s), it would amount to 1.500 / 60 = 25 WCUs!

#2 Using the datetime as the partition key:

To minimize the number of write operations needed, I could store every station’s content as a single object (using a JavaScript dictionary). That way I’ll end up with a big 230Kb object (below the 400Kb size limit of DynamoDb) and I will better use my RCUs and WCUs. But partition keys cannot be used to sort data, and that means I could not efficiently get the last updated content.

#3 Using just the datetime as the sort key:

This is technically forbidden, as a partition key is mandatory. But nothing prevents me from using a simple constant as the partition key. That way, I can have the same size optimization as for solution #2, and I can easily sort based on the datetime.

To be fair, this is not a good solution in terms of performance. Having a single partition means DynamoDb will not be able to distribute my data optimally. But cost is my most important constraint here.

Let’s recap the different characteristics of each solution.

Differents solutions for storing each station’s content, and their respective costs and characteristics

Solution #3 is the obvious winner.

Creating the table takes only a few seconds using the AWS web console.

I won’t details how to write a DynamoDb repository in TypeScript, as the official SDK documentation is already quite complete. Just be careful when marshaling/unmarshalling from a TypeScript object to a JSON. Type-transformer can be useful for that. And don’t forget to attach the correct roles to your function so that it can access your table.

And it works! The Write capacity used is just below the 4 WCU that I defined for this table.

Those cost concerns introduced a major shift between how “events” were implemented in my first Kafka prototype and how they are done in the Serverless solution: before, with Kafka, I could afford to have one event for each station, resulting in about 1.500 events per minutes. Now, one event represents the whole network of stations.

Computing the delta

I will do one last thing in the FetchStationsContent function.

I will compute, for each station, the difference between the number of bikes in the station now and the number of bikes a minute ago and I will include it in the content data. If nothing happened to a given station, I will also compute the last time it saw some activity. This inactiveSince attribute will get compounded as long as a station has no activity.

That way, the content will no longer represent the state of the stations at a given time, but also what happened to those stations (was a bike rented or returned? since when did it stop moving?).

To do that, I need to fetch the current stations’ content (with the Axios client) and the last inserted content in my Dynamo table (with the Dynamo repository). Here, my Dynamo sort key will come in especially handy to avoid scanning my whole table. Those two operations can be done in parallel using Promise.all and once both are finished I can compute the delta.

Infrastructure as code

The work on the other function, FetchStationsCharacteristics, is very similar. I need to create a function, upload some code, create an EventBridge rule, create a table and attach the correct roles to the function.

Already we see a limitation of the serverless approach: instead of having a few “big” applications, we need to create many small functions and for each one, we need to define the correct infrastructure.

Using the AWS web console is fine for the first time, but this does not scale. Am I supposed to manually create tens of functions and tables? What if I want to change some common characteristics of all my functions? How can I easily roll back an infrastructure change? This can get ugly fast. Fortunately, several tools are here to help me.

CloudFormation is an AWS service that allows to define, in YAML or JSON, the different resources needed for an application and, even better, to create and update those resources as necessary. Need a new function and a new DynamoDb table? Just add them to your CloudFormation file (or template), deploy it and everything will be created or updated accordingly.

Here is an example of a template defining a single Lambda function, a DynamoDb table, and the needed role for the function to update and query the table:

Defining the function and table is quite straightforward, but defining the role takes some work.

Fortunately, we can simplify that using SAM.

SAM, or the Serverless Application Model, is an open-source framework used to build serverless applications on AWS (and developed by AWS). It consists of two things:

An extension of CloudFormation, offerings simplified resources for the most common serverless services
A CLI, that allows you to build, test locally and deploy your application

Thanks to SAM’s simpler resources, the same Lambda function, and DynamoDb table can be described as such:

No more role to painstakingly define. SAM will translate the simplified template into a fleshed-out CloudFormation template and create the needed roles. And, you can still use “full-blown” CloudFormation resources in the same template if you need it.

To build and deploy my two functions and their corresponding tables, I just need to write the SAM template and run the following commands.

sam build
sam deploy

And it won’t work… Because SAM does not really know how to handle the TypeScript compilation process. There are several solutions to address that. I could create a dedicated Layer (a way to add additional capabilities to your Lambda). But I found it easier to use sam-webpack-plugin.
So, now my process will be:

npm run build
sam deploy

This will transpile my TypeScript functions into JavaScript, upload the source code to an S3 bucket, transform the simplified SAM template to a full CloudFormation template and deploy everything.

And I can sip my coffee as my new functions and tables are created.

Other tools exist to ease up serverless applications development. For example, Serverless offers similar features to SAM, while allowing you to work with other cloud providers. It also offers more powerful testing and alerting functionalities.

That’s a good start. Now that I’m able to store the content of each station over time, I should be able to compute the usual activity of a given station and use that to detect when a station seems locked.

See you in part 3!

Part 1: Choosing the AWS serverless stack for a prototype
Part 2: The backbone of a serverless app: Lambda functions and DynamoDb tables
Part 3: Implementing a real-time detection algorithm with Lambda functions and DynamoDb streams
Part 4: Creating a serverless API and hosting a frontend with S3
Part 5: Performance tuning for a Lambda-based API