Twenty years ago, RAM was really expensive and the disk was relatively much cheaper. As a result, most applications served data out of hard disk with only a small fraction staying in RAM. Disk had an interesting property that random reads were very slow because bulk of the time was spent in a disk seek/rotation operation. As a result, if you wanted high bandwidth of reading data from the disk, you’d want to avoid random reads and do large sequential scans. Same for writes — large sequential writes gave much higher throughput than doing random writes anywhere.
Some very smart folks invented a data structure called Log-Structure-Merge trees in 1996. This data-structure could be used to build databases that maximize read/write throughput by avoiding random seeks. Many of today’s modern database systems like BigTable, HBase, Cassandra, MongoDB, RocksDB, SQLite to name a few are based on this idea. As an example, Google’s BigTable was based on this too and had a couple salient properties:
- Storing keys with the same prefix together on the physical disk
- Completely avoiding random writes and instead doing append-only writes and large batch writes
The first one meant that if you had to scan a part of a dataset, you could design its key-space such that it physically ends up consecutive on the disk, after which it can be read in a large sequential read yielding a very high read bandwidth. The second ensured that the system could keep up with a very high throughput of writes. LSM tree was an ingenious idea, invented just to avoid random reads/writes on the disk.
This idea of doing append-only writes and only large sequential scans itself is actually pretty common even outside of databases. All MapReduce systems try to avoid small random reads/writes like a plague. Queues are nice because they naturally have this property (in addition to another nice property of maintaining a total order) and we all know how ubiquitous distributed queueing systems like Kafka have became in the last few years.
My main point is — a lot of architecture design decisions have conventionally been driven by a constraint of minimizing random reads/writes in favor of sequential access.
This is all nice and dandy. Meanwhile, here are some interesting trends that have been playing out in the last 15–20 years:
RAM prices have come down substantially.
The price per GB of RAM has come down by 6000x (!) between 1995 and 2015  . In fact, the price of RAM now is way lower than the price of disk 15–20 years ago . Because of this, we are seeing larger and larger parts of the full dataset stored, manipulated, and accessed directly from RAM. AWS has boxes that contain ~2 TB of RAM! Yes, that’s a fucking T and not a G! At Quora, we too routinely use AWS boxes with 100+ gigs of RAM.
The cost of developer time isn’t going down as quickly as that of hardware.
If anything, I would suspect that the developers have become costlier over time, at least in the last 10 years or so. (But I’m too lazy to look up some data to verify this and it also sounds so obvious to me that I think many people will readily agree with it). Either way, the cost of developers isn’t decreasing as quickly as that of RAM.
As a result of this, it’s become more important than ever to support fast ad-hoc analysis queries with a lot of interactivity. With a steady decline in RAM prices, it is no wonder that the memory based systems like Presto, Flink, and Spark are quickly replacing their conventional disk based counterparts.
Machine learning has become more mainstream.
Data mining and machine learning algorithms have become more and more mainstream. These algorithms, during the training phase, have to do a lot of heavy computation on very large volume of data. Thanks to the Von Neumann Bottleneck in computing, any iterative data intensive computation is almost always bottlenecked on memory bandwidth. As a result, it’s critical to maintain a high memory bandwidth while training ML models. I can not overstate the importance of memory bandwidth for machine learning. This is one of the (many) reasons why GPUs work so much better than CPUs. In fact, researchers at UC Berkeley have even developed a framework called Rooflining based on hardware memory bandwidths to evaluate the training performance of ML algorithms.
All in all, this means that the developers of these algorithms have to be careful to avoid random RAM lookups and rely only on large sequential reads.
The latency of random RAM lookups hasn’t improved much while our data is growing everyday.
While the memory bandwidth has gone up by > 100 x between 1995 and 2015, the latency of a random RAM lookups has mostly stayed flat  . However, our datasets have become larger and larger. According to some estimates, the total size of digital data is doubling every two years . There are plenty of companies today with significantly larger datasets than some of the largest datasets 20 years ago. At Quora too, our product usage and hence the size of our data has been growing exponentially for many years.
Due to CPU cache hierarchies, empirical random access of memory is not O(1) but O(sqrt N). Since our datasets (N in O(sqrt N)) have become larger over time, random memory reads have become increasingly bad for latency sensitive applications. As discussed earlier, machine learning is becoming more mainstream very quickly. In the prediction phase, an ML algorithm often needs to look at a lot of data. For instance, the recommendation engine powering Quora’s feed routinely looks at tens of MBs of data (and sometimes way more) during a single feed request. This request has a latency budget of only 100ms. So it is very important to optimize the placement of data such that data accessed together is close in the memory.
In fact, I think this is a broader trend — more developers have to be careful about locality of their data in RAM in order to keep their applications fast even with ever growing datasets.
My main takeaways from all of this are:
- We will continue using RAM in a lot of places where we have conventionally always used disk. Now to be fair, disks can not be fully replaced by RAM because of their ability to persist data. We will continue to store logs and full ground truth datasets on disks. However, I believe that we will see more and more systems serve larger and larger parts of data directly from RAM.
- Similar to how it has always been with disk, more and more architectural decisions will be motivated by a need to optimize for memory read/write bandwidth. This will mean a reduction in completely random access of RAM in favor of large sequential (or at least local) accesses.
It is perhaps only fitting that HBase 2, the next major version of HBase, which is an open source version of Google’s BigTable, is soon introducing an LSM tree pattern even in the RAM part (called as Memstore) of the database!
 https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html  http://www.crucial.com/usa/en/memory-performance-speed-latency