Coming up with an idea for a startup is not is an easy exercise, and bringing it to life is a beast of its own, especially if your business is going to be based on a data-driven application. Consider companies like Google, Microsoft, Amazon, Facebook, LinkedIn, Netflix, and Twitter. Each of these handles huge volumes of data and traffic.
When it comes to data driven applications, companies need to be agile, test hypotheses cheaply, and respond quickly to new market insights by keeping development cycles short and their data models flexible.
Data-driven applications are continuously pushing the boundaries of what is possible by making use of these technological developments. An application is data-driven if data is its primary challenge — the quantity of data, the complexity of data, or the speed at which it is changing — as opposed to compute-intensive, where CPU cycles are the bottleneck.
The tools and technologies that help data-driven applications store and process data have been rapidly adapting to these changes. As a result, newer types of database systems have been getting much attention. Database technology has had to evolve to match up to increasing demands of big data. Traditional relational databases cannot support the speed, scale and performance levels demanded by new applications that transact with this huge amount of data. In her article Big Data and NoSQL: The Problem with Relational Databases, April Reeves, says the following:
“Social networking and Big Data organisations such as Facebook, Yahoo, Google, and Amazon were among the first to decide that relational databases were not good solutions for the volumes and types of data that they were dealing with, hence the development of the Hadoop file system, the MapReduce programming language, and associated databases such as Cassandra and HBase.”
You’re likely to go back and forth on what database system to land on as you consider the technical and business related factors and what impact they would have on the bigger picture of your application’s requirements. In this post, I want to give some technical insight on DynamoDB in hope that you and your team would consider it.
What is DynamoDB?
It is a serverless NoSQL cloud database that can seamlessly scale on demand. The reason we need a NoSQL database is because the traditional relational databases are largely inefficient when we need to process largely unstructured data with high volume and high frequency. What does it mean that it is a serverless cloud database? This simply means that you don’t have to specify how many servers you need or what backend infrastructure you need.
Let’s take a deeper dive into what DynamoDB offers.
- A serverless cloud NoSQL DB
- Fast, flexible, cost effective
- Highly scalable, fault tolerant, and secure
Serverless — you don’t have to provision or manage any servers or infrastructure, there is minimal admin and you only pay for what you use.
Cloud — it is available as a service over the AWS (Amazon Web Services) cloud, you don’t need any installations, you can start using it right away.
NoSQL — it is a new type of database that allows you to work with big data.
Big data properties are: high volume, high variety (largely unstructured or semi structured), high velocity (huge number of read/write concurrent operations)
Fast — very high throughput with very low latency. Latency or response times are under 10 ms, and can be reduced to microseconds using DAX. DAX is a caching device provided by AWS.
Flexbile — being a NoSQL db, it can store unstructured data and doesn’t enforce a strict schema and provides a rich data model that allows yous to store a variety of rich data types to support your needs
Cost effective — only pay for what you use and no more. It is priced on the capacity provisioned for each table and not on the quantity of servers of infrastructure. You can scale it up or scale it down depending on your needs and you can have a very fine grained control of this.
Highly scalable — it can scale on demand to support unlimited read/write operations. It can also scale down on demand when/if you don’t need the high capacity.
Fault tolerant — it automatically replicates the data to multiple availability zones and thus reduces any risks associated with failures. It also supports cross region replication for even more safety.
Secure — it is secure with fine grained access control
A NoSQL Database Service
You can think of DynamoDB as a JSON document store that stores a collection of JSON objects. Each object within a table is an item, and each item has several attributes.
"item": "Go to sleep",
DynamoDB tables must have a primary key with a minimum of one attribute and a maximum of two attributes. The mandatory attribute is known as a Partition Key or a Hash Key. The optional attribute is known as a Sort Key or a Range Key. An important note is that all DynamoDB tables must have a primary key. It won’t allow you to query the database table without a primary key or table indexes. You can technically scan the tables without these keys, you will soon discover that that is not an efficient and cost effective way to use DynamoDB.
In DynamoDB, we have Local Secondary Indexes and Global Secondary Indexes.:
Local secondary indexes — those indexes that share the Partition Key with the primary key but have a different Sort Key.
Global secondary indexes — are the indexes where the Partition Key is different from that of the primary key and this is contrast to Local Secondary Indexes.
Data Types in DynamoDB
These are broadly grouped into 3 types:
- Scalar types — Exactly one value e.g. string, boolean, number, binary, and null. Keys or index attributes only support string, number and binary scalar types
- Set types — Theses represent multiple scalar types e.g. string set, number set, binary set
- Document types — Complex structures with nested attributes e.g. list and map
Remember that DynamoDB allows us to interact with it using JSON. However, it doesn’t actually store the data as JSON, its data types are a superset of data types supported by JSON. So DynamoDB automatically maps JSON documents on the native DynamoDB data types.
DynamoDB Consistency Model
Amazon data centres are hosted in multiple locations worldwide. These locations are composed of regions and availability zones. Each region is a separate geographic area and each such region has multiple isolated locations known as availability zones. Each Availability Zone may have one or more Facilities or Data Centres.
AWS Region => Availability Zones => Facilities
DynamoDB automatically replicates your data on multiple Facilities within a certain AWS Region. Even if a Facility experiences failure or downtime, DynamoDB is still able to perform consistent performance at scale.
DynamoDB Read Consistency
DynamoDB supports two types of read operations:
Strongly Consistent Reads
- The most up-to-date data
- Must be requested explicitly
Eventual Consistent Reads
- May or may not reflect the latest copy of data
- Default consistency for all operations
- 50% cheaper
By default, all read operations in DynamoDB are Eventual Consistency operations, unless you specifically request a Strongly Consistent read operation.
Most applications should be fine with Eventual Consistency, whilst making use of Strong Consistency as and when needed.
DynamoDB Capacity Units
Tables in DynamoDB are the top-level entities and are independent from each other, there are no strict inter-table relationships or any concepts of foreign keys.
DynamoDB does enforce mandatory use of primary keys in all query operations. This approach ensures that we always write highly efficient queries. Another benefit of this approach is that it allows us to control performance at the table level. Because tables are independent of each other, their performances can be controlled and tuned individually.
To do this, we must provision throughput capacity for each table.
- Allows for predictable performance at scale based on needs
- Used to control read/write throughput
- Supports auto-scaling
- Defined using RCUs and WCUs
- Major factor in DynamoDB pricing
- 1 capacity unit = 1 request/sec
We classify throughput in terms of capacity units: Read Capacity Units (RCU) and Write Capacity Units (WCU).
These capacity units are the major factor on which DynamoDB is priced. It is important to remember that DynamoDB is based on pay-per-use concept.
DynamoDB charges on the number of RCUs and WCUs provisioned to your tables and some storage fees provisioned to your data depending on the volume of data. You can control the provisioning of these capacity units. When provisioned efficiently, DynamoDB can be very cost effective.
DynamoDB stores data in partitions, which is simply a block of memory allocated by DynamoDB for storage. Tables can have one or more partitions depending on its size and provisioned throughput. These are the 2 attributes that will control how many partitions a table will have.
We don’t have to worry about these partitions. However, we can influence the partition behaviour indirectly by paying attention to the provisioned throughput and the table’s size.
Each partition in DynamoDB can hold a maximum of 10GB of data and can deliver up to 1000 WCUs (Write Capacity Units) and 3000 RCUs (Read Capacity Units) worth of throughput. If your application exceeds one or more of these limits, DynamoDB will allocate additional partitions (this happens in the background and without any downtime).
DynamoDB stores data within a partition and it uses two types of index:
Primary Index — also known as Table Index
Secondary Index — has two types: Local Secondary Index and Global Secondary Index which have been covered earlier in this post.
Any DynamoDB Index can either have a Simple Key or a Composite Key.
Simple Key — has a single attribute known as a Partition Key or Hash Key
Composite Key — has two attributes: Partition Key and Sort Key
When we create a table, we have to choose which type of key our table should have. We have to specify the Partition Key in order to query the table data. You could carry out scan operations on Tables without specifying the Partition Key, but this approach should only be used when necessary and is not recommended.
That being said, DynamoDB is not a silver bullet for every data-driven use case. I would recommend that you read Why Amazon DynamoDB isn’t for everyone by Forrest Brazeal and You probably shouldn’t use DynamoDB by Jono MacDougall to have a more thorough picture of DynamoDB’s limitations to form a more holistic view about the kind of database solution it is. As mentioned before, the aim of this post was to showcase it’s technical offerings for the data challenges you may face as you build your application.
If would you like a simple walkthrough of DynamoDB in action, you can check out one of my previous posts called Store & Fetch from DynamoDB with AWS Lambda.