Disk can be faster than memory

Elliott Jin
3 min readMar 24, 2018

--

Conventional wisdom says that “disk is slow and memory is fast”, but the Kafka documentation mentions that this isn’t nuanced enough. Access patterns matter a lot; for example, sequential disk reads can be faster than random memory reads.

Let’s demonstrate this by benchmarking the simple problem of counting DNA nucleotides: given a DNA string (e.g. “GATTACA”), we want to count the occurrences of “A”, “T”, “G”, and “C”.

We’ll start by generating some test data. For disk reads, we’ll generate test files (error checking is omitted for readability):

filename := fmt.Sprintf("dna-%d.txt", size)
f, _ := os.Create(filename)
buf := bufio.NewWriter(f)
for i := 0; i < size; i++ {
buf.WriteByte("ATGC"[rand.Intn(4)])
}
buf.Flush()
f.Close()

And for memory reads, we’ll generate test arrays, together with index arrays that specify access patterns for the test arrays:

dna := make([]byte, size)
for i := 0; i < size; i++ {
dna[i] = "ATGC"[rand.Intn(4)]
}
index := make([]int, size)
for i := 0; i < size; i++ {
index[i] = i
}
if randomAccess {
shuffle(index)
}

The actual benchmark is straightforward:

start := time.Now()
A, T, G, C := 0, 0, 0, 0
for i := 0; i < size; i++ {
// TODO: Read a nucleotide from either disk or memory.
switch nucleotide {
case 'A':
A++
case 'T':
T++
case 'G':
G++
case 'C':
C++
}
}
// Time elapsed is time.Since(start).

Reading from disk just involves creating a buffered reader:

filename := fmt.Sprintf("dna-%d.txt", size)
f, err := os.Open(filename)
if err != nil {
log.Fatal(err)
}
buf := bufio.NewReader(f)

and using it to read one byte at a time:

nucleotide, _ := buf.ReadByte()

Reading from memory is even more straightforward:

nucleotide := dna[index[i]]

Benchmarking a range of sizes from 2¹⁵ to 2²⁵ produces the following result on my laptop:

We probably want more than one data point, so here’s what happens on an Amazon EC2 t2.micro instance with a Linux AMI:

As claimed, sequential disk reads become faster than random memory reads as we increase the amount of data being processed.

By the way, the log-log axes might make it easy to miss the magnitude of this result. On EC2, random memory reads were nearly twice as slow as sequential disk reads (0.94 seconds vs 0.55 seconds) for processing 2²⁵ bytes (about 33MB)!

Lesson learned: always question assumptions and benchmark things.

Future investigations

  • What happens with random disk reads from an SSD?
  • What happens with sequential disk reads from a spinning disk?
  • What exactly causes random memory read performance to degrade (relative to sequential reads) as the amount of data increases?

--

--