Serverless Architectures with Java 8, AWS Lambda, and Amazon DynamoDB — Part 1
By Brent Rabowsky, Startup Solutions Architect, AWS
In a startup or any other company that must do fast prototyping and frequent production releases, serverless architectures that eliminate the need to deploy and manage servers provide a substantial development speed boost. In this blog post series, I describe a serverless architecture for a common use case on AWS: a Java-based API backed by Amazon DynamoDB as its data store. By using AWS Lambda to implement the API together with DynamoDB, you don’t have to deploy or manage servers for either the application tier or database tier. If the front end consists of mobile devices and a web app statically hosted on Amazon S3, the result is a completely serverless architecture with no server deployment or management required anywhere in the system, front end or back end.
Lambda is central to building a serverless architecture on AWS. Lambda functions have a wide range of uses, from developing APIs to developing event-driven architectures. Indeed, Lambda functions can be considered as the “connective tissue” that links together the many AWS services that provide event sources for triggering Lambda functions. In the context of API development, each Lambda function serves to implement a single API call, allowing for rapid, iterative API development. For RESTful API development in particular, Lambda forms a dynamic duo with Amazon API Gateway, which among many other features maps API calls to their implementing Lambda functions.
In addition to Lambda, DynamoDB is another critical component of a serverless architecture. With DynamoDB, you don’t have to worry about managing a database cluster at scale, which often is the downfall of other NoSQL database solutions. DynamoDB is fully managed and can be integrated and scaled without the need to allocate and provision any cluster nodes. To use DynamoDB, simply create tables and set write and read capacity for them. AWS handles the rest.
This post is part 1 of a two-part series. In this post, I focus on data modeling with DynamoDB. I describe an example use case to demonstrate alternative ways of modeling the same data, and the pros and cons of each approach. Proper data modeling is an essential prerequisite to beginning the development of a back end service. The second post in this series will demonstrate the use of Lambda functions to implement an API, and how to interact with DynamoDB using the AWS SDK for Java. All of the code referenced in this series can be found on GitHub at https://github.com/awslabs/lambda-java8-dynamodb.
Example Use Case
For the purposes of this post, let’s suppose there is a company that maintains a catalog of sports events. The company has decided to build an API backed by DynamoDB to access the catalog. For each event, the company must have a record that includes the name of the home team, the event date, the name of the other (away) team, the sport (such as basketball or baseball), city, country, and so on. The company is building a new home page for the application that will display all local events for a user’s favorite home team, as well as all other sports events in the user’s home city.
Because most users of the company’s application will spend most of their time in the application checking for local events via the home page, these queries are the most frequent. Other queries regarding events must be supported, but queries to support the home page are the most important and should be the most performant. With these design requirements in mind, let’s proceed to data modeling for DynamoDB. (Before you proceed, however, you should be familiar with the DynamoDB core components such as tables, items, attributes, keys, and indexes.)
With a NoSQL database such as DynamoDB, it is helpful to think of data modeling with respect to how to structure data to efficiently support the queries required by the application. This is very different from data modeling for a relational database, which involves structuring data around relationships between domain objects while normalizing the data, thereby reducing data duplication as much as possible.
NoSQL data modeling, by contrast, often involves at least some duplicated data within and between tables, which is referred to as denormalized data in a relational database context. You sometimes can avoid this data duplication in DynamoDB by adding indexes to a table, or by performing more than one query to gather a result set. An index in DynamoDB can be either a local secondary index (LSI), which uses the same partition key as the table but has a different sort key, or a global secondary index (GSI), where both the partition and sort keys can be different from the table’s keys. You can create up to five LSIs and five GSIs per DynamoDB table, but note that unlike an LSI, each GSI must be allocated its own provisioned capacity separate from the underlying table’s provisioned capacity.
For the example in this post, I use a single EVENT table in DynamoDB to model the data, which essentially is a catalog of available events. Within this table, events are modeled using the home team name as the partition key, the event date as the sort key, and the away team name as an index. Each event is modeled as a single item in the table. The design for the table and its indexes is shown below. In regard to the indexes, the AwayTeam-Index is a GSI that enables looking up all of a team’s events, while the City-Index is a GSI that enables looking up all local events in a city.
However, note that this makes querying for all of a team’s events more complex: instead of a single query based on team name, two queries must be used to gather all of a team’s events. Specifically, one query on the table itself is for the case where the team is the home team, and another query on AwayTeam-Index is for the case where the team of interest is an away team. The results of the two queries are then combined to produce the complete result set for all of the team’s events.
By contrast, to support retrieving all of a team’s events in a single query of one table, it’s necessary to redesign the EVENT table. Instead of separate home team and away team attributes, a single “team” attribute represents all of a team’s events, both at home and away. With this design the data is modeled using two items per event. For example, given an event involving Team A and Team B, if in one item the event is a home team event for Team A (with Team A as the partition key), then in the second item it will be an away team event for Team B (with Team B as the partition key). The sort key remains the event date. The table design is as follows (note that the index design is the same as for the first approach, minus the AwayTeam-Index):
Accordingly, a query for all of a team’s events simply becomes a query for the entire range of sort keys associated with the partition key that corresponds to the team’s name. To support queries for home events only, a Boolean attribute isHomeEvent in the EVENT_ALTERNATE table could be used to filter the result set to only home events.
The advantage of modeling the data with the first approach (each event appears in only one item) is that it reduces by half the number of items in the table, and thus reduces the table size. It also eliminates the need to duplicate data within the table, and avoids the issues involved in synchronizing data updates within the table. However, data modeling often involves tradeoffs.
In this case, the tradeoff for avoiding data duplication is the need to create an index on the away team attribute. Otherwise, there would be no ability to query for a team’s events away from its home locale. This is due to the fact that a DynamoDB query must be based on a partition key either of a table or an index (and optionally a sort key as well). By contrast, for a relational database, a SELECT query can be based on any table column even if the query is not supported by an index (and thus is not performant).
For some use cases, the duplicated data approach could be a reasonable way to model the data. The approach taken depends on the kinds of queries to be supported, the relative frequency with which the queries will be made, and the relative importance of table size versus latency of retrieving the complete result set. In the example use case for this post, however, the most important queries are for a user’s local home team events and queries for the user’s city (for all events in that city), while queries for all of a team’s events (home and away) are less frequent and less important.
On to Part 2!
Proper data modeling is critical to developing a performant back end service. Before you start to code, always think carefully about how to structure your DynamoDB tables, items, and indexes so they can support your application’s queries with the greatest efficiency.
In the next post in this series, I dive deep into the details of how to use the Lambda Java 8 runtime, along with the AWS SDK for Java, to implement a back end service for the example use case. Along the way I’ll make use of the data model I designed in this post.