Exploring a Tech Stack : Part 2 — Databases

My Journey Into a World-class Website’s Tech Stack

6 min readAug 5, 2020

In parts 1A and 1B I covered the Production environment used by Medium back in 2015, as detailed in this article. This week I’ll be looking into the data storage they were using.

The technologies in this section are:

NoSQL databases (Amazon DynamoDB)
SQL databases (Amazon RDS for Aurora)
In-memory databases (Redis)
Graph databases (Neo4J)

My first question is what are the advantages and disadvantages of each type of database? Why use multiple types of databases?

In one case, it looks like it’s done for speed (Redis in front of DynamoDB), in the other cases (Aurora and Neo4J), for the way in which data is queried and related to one another. Dan states DynamoDB is Medium’s primary data store.

What is DynamoDB?

Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It’s a fully managed, multiregion, multimaster, durable database with built-in security, backup and restore, and in-memory caching for internet-scale applications. DynamoDB can handle more than 10 trillion requests per day and can support peaks of more than 20 million requests per second. [1]

Pulling out some keywords:

key-value — stores data with a simple key and a value, however, the value is “opaque”. Data inside it cannot be accessed by the database and indexed [2]. “Key-value databases are the simplest of the NoSQL databases. The basic data structure is a dictionary or map”. [3]
document — data is stored in a structured, transparent format that allows the database to access and index the data inside[2].
fully managed — no need for server provisioning (setting up the server for initial launch), patching (updating the server software for security and bug fixes), management or software installation, maintenance and operation.
multiregion — ability to replicate data to servers located in different regions around the world so users can access data locally
multimaster — multiregion replication is served via multiple master DynamoDB tables. Global Tables replicates data that is normally only replicated within a geographic region, across multiple regions.
durable — is part of the ACID set of properties (requirements to be ACID-compliant). Durability ensures that once a database transaction has finished, the changes will persist even if there is a system failure.

And adding:

highly available — is a term referring to the uptime performance, or how likely we are to find the server running the way it should. SLAs (Service Level Agreements) outline what denotes uptime, downtime and service credits for outages (DynamoDB SLA). ISPs (Internet Service Providers) offer similar SLAs to cover outages. These mostly apply to business and enterprise level services (Verizon SLA).

Additional Questions:

Q. (Related to key-value) What is the difference between a map and a dictionary?

A. Stackoverflow link.

Q. (related to multimaster, multiregion) What regions are available for DynamoDB?

A. They are:

US East (N. Virginia)
US East (Ohio)
US West (Northern California)
US West (Oregon)
Asia Pacific (Mumbai)
Asia Pacific (Seoul)
Asia Pacific (Singapore)
Asia Pacific (Tokyo)
Canada (Central)
Europe (Frankfurt)
Europe (Ireland)
Europe (London)
Europe (Paris)
South America (Sao Paulo)
AWS GovCloud (US-east)
AWS GovCloud (US-west)

Q. What is ACID in regards to Relational Databases?

A. It stands for:

Atomic — all or nothing. All operations of the database transaction must occur or none may occur.
Consistent —valid data, or rolled back. Data written to the database must follow the rules of the database. Transactions that produce inconsistent or invalid data must be rolled back.
Isolated — sequential transactions. Transactions must be applied sequentially, even when multiple users initiate transactions that are targeting the same tables. The database must handle concurrent operations gracefully.
Durable — completed transactions are saved, even in the event of failure. Once the database says it has completed the transaction, it should be restorable.

Why use RDS (SQL) when you already have a DynamoDB (NoSQL) database?

It looks like it comes down to the way data is stored and queried. Key-value and document based storage of NoSQL databases is useful for fast, simple queries. They also scale horizontally, which means high performance and flexibility. However, their simplicity also makes it difficult to construct complex data queries. Automation for Jira switched to SQL databases for the ease of implementation of complex data queries. Here’s an interesting read on the resurgent popularity of SQL, with lots of followup links on database history. And two more articles about the differences between sql and nosql and when to use them:

Guru 99: SQL vs NoSQL: What’s the difference?
Sitepoint: SQL vs NoSQL: The differences by Craig Buckler

Why use Redis when DynamoDB has DAX?

Firstly, what is Redis?

“Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. “ (redis.io) By holding the data structure in-memory it allows faster access than if it had to access even the fastest SSD.

Next, what is DAX?

“Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement — from milliseconds to microseconds — even at millions of requests per second. DAX does all the heavy lifting required to add in-memory acceleration to your DynamoDB tables, without requiring developers to manage cache invalidation, data population, or cluster management.” — aws.Amazon.com

What’s a message broker? IBM has an in-depth explanation, but to me it seems very much like a modem’s functionality of modulating data from one source’s language and demodulating it to a destination’s language. A translator, or interpreter of sorts, to process data streams.

How fast is in-memory cache vs disk storage cache/database?

A lot.

Here’s a great introduction at Brown University about caching (not just database servers, but on a computer in general).

Here’s a nice illustrative example of a 1TB Samsung SSD vs a Ram disk running on DDR4–3000 dual channel RAM.

A good overview of IMDBs (in-memory databases) with an overview of performance, size constraints (you can only have so much RAM memory on a machine), persistence (what happens when the computer crashes/reboots?)

What are some pitfalls when it comes to caching databases?

Caching involves the need for constant synchronization and validation of data to make sure it remains consistent with the data store. Microsoft’s article on Cloud Design Patterns — Cache-aside pattern contains a useful list of considerations including:

Data lifetime — how long data should be kept, its important to find a balance between frequent updates (takes resources and time), and too infrequent updates (stale data).
Data eviction — how to handle the situation when new data is ready to be added but the cache is full. Some data needs to go, to make room for the new.
Data Consistency — how consistent do you want your data to be? The more consistent it is the more frequent updates need to be. Eventual consistency is a balanced approach.

So, why use Redis when DynamoDB already has DAX?

After some research I found out that DAX was launched in June of 2017, so it looks like it wasn’t available at the time the article was written.

Neo4j — Graph Database

Apparently graph databases are the future of humanity and without them society will not be able to evolve. At least, that’s the impression I get from the literature and videos about graph databases.

They have a distinct advantage when it comes to mapping and storing relationships between data, and querying data from these databases, in certain cases, is simplified.

The “cleanest” examination of graph databases I found is this video “A Skeptics Guide to Graph Databases” by David Bechberger. (Dave’s twitter is a good one to follow for graph related tweets).

A lot of the material out there seems like marketing material from Neo4j, which I find semi-useful, but I have to constantly ask “Is that really how it is? Or are you keeping back useful information that doesn’t promote your product.”

Footnotes:

“Amazon DynamoDB”. https://aws.amazon.com/dynamodb/. Accessed 8/4/20.
“What does “Document-oriented” vs. Key-Value mean when talking about MongoDB vs Cassandra?”. Answer by Pascal Thivent. Stack Overflow. https://stackoverflow.com/questions/3046001/what-does-document-oriented-vs-key-value-mean-when-talking-about-mongodb-vs-c. Accessed 8/4/20.
“NoSQL Key-Value Database Simplicity vs. Document Database Flexibility”. Dan Sullivan and James Sullivan. 09/16/15. InformIT. https://www.informit.com/articles/article.aspx?p=2429466. Accessed 8/4/20.