Scaling DynamoDB for Big Data using Parallel Scan

Published in

ZenOf.AI

9 min readApr 7, 2019

DynamoDB is a fully managed NoSQL service that works on key-value pair and other data structure documents provided by Amazon and it requires only a primary key and doesn’t require a schema to create a table. With this, we can expect a good performance even when it scales up. It can store any amount of data and serve any amount of traffic. It is a very simple and small API that follows key-value method to store, access and perform advanced data retrieval.

DynamoDB comprises of three fundamental units known as table, attribute, and items. A table holds a set of items, an attribute is the simplest element that stores data without any further division and item holds a set of attributes.

Indexing

In a relational database, an index is a data structure that lets you perform fast queries on different columns in a table. You can use the CREATE INDEX SQL statement to add an index to an existing table, specifying the columns to be indexed. After the index has been created, you can query the data in the table as usual, but now the database can use the index to quickly find the specified rows in the table instead of scanning the entire table.

In DynamoDB, you can create and use a secondary index for similar purposes. Indexes in DynamoDB are different from their relational counterparts. When you create a secondary index, you must specify its key attributes — a partition key and a sort key. After you create a secondary index, you can Query it or Scan it just as you would with a table. DynamoDB does not have a query optimizer, so a secondary index is only used when you Query it or Scan it.

DynamoDB supports two different kinds of indexes:

Global secondary index — an index with a hash and range key that can be different from those on the table. A global secondary index is considered “global” because queries on the index can span all of the data in a table, across all partitions.
Local secondary index — an index that has the same hash key as the table, but a different range key. A local secondary index is “local” in the sense that every partition of a local secondary index is scoped to a table partition that has the same hash key.

DynamoDB ensures that the data in a secondary index is eventually consistent with its table. You can request a strongly consistent Query or Scan actions on a table or a local secondary index. However, global secondary indexes only support eventual consistency. You can add a global secondary index to an existing table, using the UpdateTable action and specifying GlobalSecondaryIndexUpdates. After you create an index, the database maintains it for you. Whenever you modify data in the table, the index is automatically modified to reflect changes in the table.

Partition Key: To create table and item, it is mandatory for the DynamoDB and DynamoDB partitions the items using this Partition Key, So this key is also called as the partition key and some times is also referred as a Hash Key.

Sort key: It is not mandatory. This is useful while querying the data relating a Partition key. We can use several different filter functions on the sort key such as begins with, between, etc. Some times it is also referred to as a Range Key.

Primary Key: It is just a combination of both Partition key and Sort Key.

Advantages of Dynamo:

Fast: Each table in NoSQL is independent of the other. NoSQL provides us the ability to scale the tables horizontally, so we can store frequently required information in one table. All the table joins need to be handled at the application level. Thus, data retrieval is fast.
Scalable: As the user base grows and we require a database which has capabilities to handle the added load, most of the NoSQL databases have the capabilities to scale as the data grows. There are various types of NoSQL databases available in the market, and the scalability of database varies among different types, we have to choose the database as per our application needs.

Read Scaling: a Large number of Reading operations

Write Scaling: a Large number of Write operations

3. Schemaless: In relational databases, for each table, we have to define a schema, where we specify the number of columns and the type of data it holds. It is difficult to change the datatype of the column, and adding a new column will result in lots of null values in the table. In NoSQL databases, adding/removing column is easy because we don’t have to specify schema on table creation.

Differences between RDBMS and Dynamo:

RDBMS is a completely structured way of storing data, where the Dynamo is an unstructured way of storing the data.
The main difference is that the amount of data stored mainly depends on the Physical memory of the system. While in the Dynamo you don’t have any such limits as you can scale the system horizontally.
The AWS Dynamodb is a complete managed service in which there is no minimum fee to use DynamoDB. You will pay only for the resources you provide. The AWS will take care of millisecond latency at any scale. When it comes to RDS MYSQL lets you set up, operate and scale database on AWS.so you need to choose Multi A-Z or single A-Z deployment, Db class instance (micro, small, large, xlarge), storage, etc.
Dynamodb is a distributed NoSQL solution designed for very large datastore/extremely high throughput NoSQL application, while RDS shines in smaller scale flexible traditional RDBMS for far more query and design flexibility.
For simple application and small data set you can go with Dynamodb, For a large & complex application, go for Dynamodb if you look for high throughput or you can choose RDS if you look for a cheaper option.

Other competing NoSQL technologies:

MongoDB:

MongoDB is the next-generation NoSQL database that helps businesses transform their industries by harnessing the power of data.

DynamoDB uses key-value with JSON support. It has up to 400 Kb record size. It has limited data type support. MongoDB uses JSON like documents. It has up to 18 Mb record size.

DynamoDB runs only on AWS, whereas MongoDB can be installed and run anywhere (including an engineer’s computer).

MongoDB is primarily an in-memory database. This means that if your data sets are much larger than the available memory, MongoDB is a poor choice.

Amazon ElastiCache:

ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the cloud. If you care about the durability of your data, DynamoDB is the way to go. Basically, if you want a NoSQL system of record, use DynamoDB. If you want a cache whose contents you don’t care about losing, use ElasticCache.

Apache Cassandra:

Apache Cassandra is an open source distributed database.

Amazon DynamoDB is a key-value and document-oriented store, while Apache Cassandra is a column-oriented data store.

Although DynamoDB can store numerous data types, Cassandra’s list of supported data types is more extensive: it includes, for instance, tuples, varints, timeuuids, etc.

In DynamoDB, partition keys and sort keys can contain only one attribute. While Cassandra allows including more than one column(attribute) into partition keys and clustering columns.

Couchbase:

An open-source, NoSQL, document-oriented database, optimized for interactive applications. Couchbase is a much better option for applications demanding high performance, consistency and flexible querying. By default, there is a fully managed cache layer integrated under the covers to make your reads and writes really fast, and the data can be easily queried using a SQL-like language called N1QL

Reading large volumes of data via scan vs parallel scan

How does scan work in AWS DynamoDB?

Scan operation returns one or more items.

By default, Scan operations proceed sequentially.

By default, Scan uses eventually consistent reads when accessing the data in a table.

A Scan operation performs eventually consistent reads by default, and it can return up to 1 MB (one page) of data. Therefore, a single Scan request can consume

Code Sample for Scan Operation:

import boto3
import json
import re
def lambda_handler(event, context):
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('master')
    response = table.scan()
    data = response['Items']
    while 'LastEvaluatedKey' in response:
        response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
        data.extend(response['Items'])
    return {
        'statusCode': 200,
        'headers': {
            'Access-Control-Allow-Origin' : '*',
        },
        'body': json.dumps(data)
    }

How parallel scan works in AWS DynamoDB?

For faster performance on a large table or secondary index, applications can request a parallel Scan operation.

You can run multiple worker threads or processes in parallel. Each worker will be able to scan a separate segment of a table concurrently with the other workers. Dynamo DB’s Scan function now accepts two additional parameters:

Total Segments denotes the number of workers that will access the table concurrently.

Segment denotes the segment of table to be accessed by the calling worker.

The two parameters, when used together, limit the scan to a particular block of items in the table. You can also use the existing Limit parameter to control how much data is returned by an individual Scan request.

import threading
import boto3
access_key = ‘change here’
access_secret = 'change here'
region = 'us-west-2'
def scan_foo_table(segment, total_segments):
    print('Looking at segment ' + str(segment))
    session = boto3.session.Session(aws_access_key_id = access_key, aws_secret_access_key = access_secret, region_name = region, profile_name = 'dev')
    dynamoDbClient = session.client('dynamodb')
    response = dynamoDbClient.scan(
        TableName='table_name,
        FilterExpression='classification=:classification',
        ExpressionAttributeValues={
            ':classification': {'S': 'DIRECT'}
        },
        Segment=segment,
        TotalSegments=total_segments,
    )
    print('Segment ' + str(segment) + ' returned ' + str(len(response['Items'])) + ' items')def create_threads():
    thread_list = []
    total_threads = 50for i in range(total_threads):
        # Instantiate and store the thread
        thread = threading.Thread(target=scan_foo_table, args=(i, total_threads))
        thread_list.append(thread)# Start threads
    for thread in thread_list:
        thread.start()# Block main thread until all threads are finished
    for thread in thread_list:
        thread.join()create_threads()

If the total number of scanned items exceeds the maximum data set size limit of 1 MB, the scan stops and results are returned to the user as a LastEvaluatedKey value to continue the scan in a subsequent operation. The results also include the number of items exceeding the limit. A scan can result in no table data meeting the filter criteria.

Scan vs Parallel Scan in AWS DynamoDB?

A scan operation can only read one partition at a time. So parallel scan is needed for faster read on multiple partition at a time. A sequential scan might not always be able to fully utilize the provisioned read throughput capacity. So parallel scan is needed there. Parallel scans, reduce your costs by up to 4x for certain types of queries and scans.

Scenarios in which Parallel Scan is preferred?

A parallel scan can be the right choice if the following conditions are met. The table size is 20 GB or larger. The table’s provisioned read throughput is not being fully utilized. Sequential Scan operations are too slow.

This story is authored by Ajay Kudikala. Ajay is a Full Stack Developer and also specializes on AWS dev stack.

Scaling DynamoDB for Big Data using Parallel Scan

Indexing

Other competing NoSQL technologies:

Reading large volumes of data via scan vs parallel scan

Written by Engineering@ZenOfAI