Apache Spark With DynamoDB Use Cases

Code examples of JAVA Spark applications that writes and reads data from DynamoDB tables running in an AWS EMR cluster.

Leonardo Carvalho
The Startup
6 min readMay 1, 2020

--

Introduction

After some years using Apache Spark to develop data pipelines using primarily data stored in Cassandra tables or AWS S3 buckets, I’ve found myself surprised to discover how little references are out there for using it with DynamoDB tables.

As we shifted towards cloud-based infrastructures, DynamoDB increasingly became a no-brainer choice for applications that needs to deal with a lot of data meanwhile providing availability and scalability. These goals are very close aligned with the distributed nature of Apache Spark applications and putting these technologies together to ingest, transform and generate data seems a logical and almost inevitable choice.

References

In this article we will write JAVA Spark applications ready to run in an AWS EMR cluster using two different connectors:

Later we will show some preliminary comparisons between the two connectors performance, but for now we’ll stick with the following statements:

First, the AWS Labs connector is a very straightforward way of making it possible to create RDDs in Spark that points right to a DynamoDB table data, which delivers a more efficient way of accessing it.

As for the Audience Project connector, the high points are a set of great — quality of life — features such as schema inference, throughput control and SGI support, all of which comes in exchange for performance due to it’s use of the common — HTTP based — DynamoDB API.

Provisioning The Infrastructure

In order to run the applications described in this article, we will need some infrastructure resources based on the Amazon Web Services:

  • DynamoDB tables for storing the data;
  • An EMR cluster for running the applications;
  • An S3 Bucket for storing files with input data that will be written to the DynamoDB table;
  • Permission configurations;

Follow the instructions of the spark-dynamodb-infrastructure project to build these resources in your AWS account.

Writing Data With The AWS Labs Connector

In this first example, we’ll develop a Spark application that reads a file containing a list of studies with citations to COVID-19 cataloged by the World Health Organization (updated data can be found here), do some cleaning in the data and store it in a DynamoDB table named Covid19Citations.

For the sake of simplicity in this article we will focus on the JAVA code, but a detailed step by step of how to execute the application can be found here.

1. Creating the Citations Dataset

In this step we’ll create a Dataset that points to the citations file uploaded to the S3 Bucket. The data is already organized in a CSV format, so all we have to do is to use the csv method of the Spark SQL DataFrameReader.

The original data has a lot of columns with irrelevant values for our purposes, so we’ll clean it up by selecting just the main columns. Lastly, we’ll remove the citations that didn’t have titles.

2. Building a RDD With DynamoDB Writable Items

The next step is to transform the citations dataset in a RDD composed by DynamoDB Writable Items. We’ll assign a UUID to each citation and make sure to not create attributes with null values.

3. Store The Data In The DynamoDB Table

Finally, we’ll write our RDD right into the Covid19Citation table. Pay attention to the creation of the JobConf instance containing the DynamoDB table information.

Reading Data With The AWS Labs Connector

Now that we have a DynamoDB table populated with data we can build a Spark application to do some operations on it.

The steps described below will read the citations data, filter only the ones published in 2020 and count the number of time each word appeared in the citations titles.

Similar to the previous example, detailed instructions can be found here.

1. Reading The DynamoDB Data

To read the data stored in the DynamoDB table, we’ll use the hadoopRDD() method of the SparkContext. With the citations RDD created, we’ll filter the ones published in 2020.

2. Counting The Title Words

First we’ll build a RDD containing only the data in the title column. Then, we’ll use the flatMap operation to transform it in a RDD of all words of the titles.

Lastly, we’ll do a countByValue in the words RDD and print in the console the results.

Reading And Writing Data With The Audience Project Connector

Likewise the previous application, in this example we will count the number of times each word appeared in the citations titles, but with two main differences:

  • We will use the Audience Project data source to connect to the DynamoDB tables;
  • And we will store the results in a table rather than just print it in the console.

Detailed instructions and the full application code can be found here.

1. Reading The DynamoDB Data

Creating a Dataset with the contents of a DynamoDB table is pretty easy using the Audience Project connector. In fact, we use the read method of the SparkSession setting some information about the table name and the heavy lift is done underneath by the connector with no changes to an usual Spark code.

2. Counting The Title Words

This time we’ll count the number of appearances of each word using a JavaPairRDD. Despite the more complicated code at first glance, it’s a more efficient way of doing it, as we’ll use lazy operations that will only run when we call action operations, such as count.

In short, we could arrange the code in a way that Spark would only count the number of words when we called a operation to write the results to the table, doing it in the most efficient and parallelized way.

3. Writing The Result To A Table

To store the results to a DynamoDB table, we will first create a model object that matches the table structure:

Then we’ll transform the JavaPairRDD in a Dataset of the WordCount object, remove the empty string words and store it using the write operation.

Comparing The Word Count Applications

Speaking of performance, how the two Word Count applications stands against the other?

The full time of execution it’s a bad measure, as the Audience Project connector application did some operations such as showing Dataset glances and writing the result to a table that the AWS Labs connector application did not.

Nevertheless, we can have a close look to each Spark operation and find some insights.

AWS Labs Application Spark Operations:

Audience Project Application Spark Operations:

The count operations are basically the same between the two applications. For instance, we can see that AWS Labs connector was able to do a count in the whole table in only 6 seconds (row with ID equals 0) while the Audience Project connector took 17 seconds (row with ID equals 1).

This is a pattern that continues in all operations, and while it won’t hurt to some more research and experimentation, it’s pretty clear that performance focused applications will do better to use the AWS Labs connector.

But if performance it’s not your bottleneck and you want to enjoy some great features that eases the coding process, the Audience Project connector has what it takes.

Conclusion

We are just scratching the surface, but I hope to have helped with a overview of what can be done connecting Spark applications to DynamoDB tables. There’s already a lot of possibilities here and I expect to go a little deeper the next time, as we evolve our applications using this kind of integration.

--

--