We Live on the Planet of the Apps…

Published in

NoSQL in Azure

9 min readSep 9, 2016

Meet a modern app developer — Grace. She is trying to build an app that is fast, global and flexible. Let’s call it GraceApp. You never really understand a person until you consider things from her (or his) point of view, so let’s climb into Grace’s shoes and walk around in them for a little bit to understand her true pain points in developing GraceApp.

As Grace (aka modern app developer), what do I really need for my app? I need performance, elasticity, availability, global scale and an easy way to make changes in my app. So put it simply, a survival circa 2016 and beyond is something as follows:

The reality is that most relational systems cannot fulfill these requirements for a GraceApp. And it’s not because they are being naive or stupid, but rather they are burdened by their implicit architecture constraints and legacy:

If you are still skeptical (that’s good! you are paying attention), let’s consider a few very simple examples that you may encounter today:

E-commerce: Order-processing systems are largely “done” (that’s what RDBMSs are good at). The primary focus is on better search and recommendations or adapting content or prices on the fly (that’s where NoSQL shines).
Mobile (e.g., payments): Processing a credit card is an ‘easy’ part (that’s what RDBMSs are good at). The hard part is making it location-aware, so it knows where you are and what you are buying and when and in what context (that’s where NoSQL comes in).
Web: The focus today is on better recommendations, personalization (NoSQL’s strength), not processing monthly bills (RDBMS’s strength).

So let’s go back to a fundamental set of “gravitational problems” that a modern app developer faces.

‘Gravitational Problems’ for a Modern App Developer

1. Elasticity

1.a Elasticity of Storage

And the story goes like this (let’s assume GraceApp is incredibly successful):

Day 0: storage needs = 500GB
Day 30: storage needs = 3TB
Day 90: storage needs = 100TB
Day 360: storage needs = 5PB

The exact numbers above aren’t as important as the relative trend, going from GBs to TBs to 100s of TBs to PBs in a matter of days. The typical solution for relational or any other on-prem NoSQL system is “time to buy and configure more hardware”. You buy more RAM, CPU, hard disks, so that you can scale to your app needs. Obviously, once you scale up, it’s hard to scale down (you own the damn hardware), so elasticity of storage is out of the question.

Oh, and one more App caveat: as you are scaling your backend data platform, the partitioning of data (e.g., with horizontal scaleout) must happen extremely fast, without effecting any of the queries that might be mid-flight. Add to that TBs and PBs of data and hundreds or thousands (potentially millions) of queries — it’s a hard problem!

1.b Elasticity of Throughput

Throughout elasticity = it’s complicated! To give you a mental picture analogy, imagine a blood circulatory system. It’s kind of like that. Even for a single app, throughput elasticity requirements are very diverse and dynamic.

Elasticity of Throughput — Visual Analogy

Both read and write throughput latencies may vary depending on the geographical location and time of the day, from single digit read and write latencies to tens of thousands per second. How so, you may wonder? Well, consider a simple example, Grace’s app has users both in New York City, USA and in New Delhi, India — all using it simultaneously.

When it’s 8PM in New Delhi, the city is buzzing, there is lots of activity: reads and writes are pounding the App. At the same time, it’s 5AM in New York City, and the activity is very slow there. So the the throughput requirements for the same App are very versatile and have a high degree of volatility.

2. Performance Must be “Kick-Ass”

When it comes to performance, GraceApp needs lightning speed. How do you achieve that and “defeat the speed of light”? You serve local! In other words, you put your data where your users are. Here is a Globally Distributed Store Front simulation (let’s call it a GraceApp simulation) running against Azure DocumentDB.

The closer you are to the source of your data, the smaller is your latency. Go play with it and see for yourself! Btw, here are a few links that are powered by DocumentDB providing low latency regional access:

MIXIT — this is a place where you can select the paint product lines and documentations used in your shop. The system will remember these settings next time you log in (regardless of where you are on the globe).
Xbox Design Lab —this is a place where you can design your own personal controller with over 8 million possible color combinations. Give it a try. It’s actually pretty neat.

3. Availability : Nobody wants their App down, right?

That’s a “no-brainer”. An app must be available everywhere and at all times. And if things should go down, I (the App developer) should be able to specify a policy based fail-over.

You can see… the plot really thickens now…

4. Consistency: Trade-offs or Choices

With all of the above requirements about performance, scalability, elasticity, and availability, the choice for most distributed Apps today boils down to a tradeoff:

Red Pill: Strong consistency w/ high latency
Green Pill: Eventual consistency w/ low latency

There is no such thing as a free lunch. In theoretical computer science, the CAP theorem, also named Brewer’s theorem after computer scientist Eric Brewer, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantee. You have to pick 2.

Why is this important? CAP theorem describes the trade-offs involved in distributed systems. A proper understanding of CAP theorem is essential to making decisions about the future of distributed database system design. Misunderstanding can lead to erroneous or inappropriate design choices. Brewer proposed CAP theorem to describe the system’s behavior - consistency/availability/latency tradeoffs (in the face of failures/partitions or in steady state). 12 years later he revisited it again in CAP Twelve Years Later: How the “Rules” Have Changed.

Daniel Abadi’s Consistency Tradeoffs in Modern Distributed Database System Design proposed an alternative formulation of CAP theorem is PACELC, an acronym for Partition — Availability vs. Consistency, Else — Latency vs. Consistency. Abadi’s suggested improvements over CAP by proposing PACELC as a better way to reason over the steady state vs. failure cases.

If there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)? As such, PACELC divides the situation by partition, giving you more options: PA/EL, PA/EC, PC/EL, PC/EC.

Martin Kleppmann in “Please stop calling databases CP or AP” and a few others have realized that PACELC is too over-simplified since the real-world systems are too complex/subtle.

Doug Terry’s definitions in his “Replicated Data Consistency Explained Through Baseball” are better and illustrate a more precise approach for capturing the consistency aspects. Please watch Doug Terry’s talk sometime.

The phenomenon of the consistency/availability/latency tradeoffs (in the face of failures/partitions or in steady state) is a continuum (thank you, @dharmashukla for this great point!).

The main takeaway I want you to remember is that when faced with CAP theorem (i.e., Low Latency + Consistency vs. Availability Tradeoff) you should pick what makes the most practical sense for your App.

That was the motivation behind having Programmable Consistency Levels in DocumentDB.

An application developer decides which consistency level makes the most sense for her App out of the following 4 available today (Strong, Bounded Staleness, Session and Eventual) consistency levels. You can change them later at any time, if the application needs change:

Strong: Strong consistency offers a linearizability guarantee with the reads guaranteed to return the most recent version of data. Strong consistency guarantees that a write is only visible after it is committed durably by the majority quorum of replicas.
Bounded staleness: Bounded staleness consistency guarantees that the reads may lag behind writes by at most K versions or prefixes of a document or t time-interval. Bounded staleness offers total global order except within the “staleness window”. For globally distributed applications, bounded staleness is recommended for scenarios where you would like to have strong consistency but also want 99.99% availability and low latency.
Session Consistency: Session consistency is ideal for all scenarios where a device or user session is involved since it guarantees monotonic reads, monotonic writes, and read your own writes (RYW) guarantees. Session consistency provides predictable consistency for a session, and maximum read throughput while offering the lowest latency writes and reads.
Eventual: Eventual consistency guarantees that in absence of any further writes, the replicas within the group will eventually converge. Eventual consistency is the weakest form of consistency where a client may get the values that are older than the ones it had seen before. Eventual consistency provides the weakest read consistency but offers the lowest latency for both reads and writes.

5. Global Scale

We live in a global society. I may be living in Western US, but most of my customers may be in Western Europe, and most of my usage growth may be in Asia, and some of my most loyal customers may be narrowed down to a particular country.

Keeping the whole world satisfied is no easy task… for an App developer.

Developing an application that is backed up by a managed cloud service gives and App developer a unique opportunity to reach new markets and customers that span the globe. The reach of an App is no longer constrained by a business’ ability to build its IT infrastructure but rather how quickly it can develop applications that scale globally. Not only must these applications be accessible from anywhere, but in order to win customer loyalty they must be responsive and highly available.

So Let’s Recap for Now

GraceApp, in order to be successful, needs:

But wait… Grace (still in App developer shoes, remember…) wants all of the above, with Enterprise Level SLAs for every single one of these things!

6. Enterprise Level SLAs

Why SLAs? Well, Grace wants to focus on her application without having to deal with data infrastructure. All she wants to do is create a collection for data, set her perf/scale/availability/consistency requirements, and let the data service handle the rest.

By now (if you are human :)) you should be wondering: