A Brief History of MapReduce

Oscar Stiffelman
3 min readApr 6, 2018

MapReduce has been the most important technological innovation that I have witnessed in my career. To say that I was an early adopter is an understatement. My office was next to Jeff and Sanjay back when they were building it. One weekend, they killed a huge job I’d been running (I was working on the weekend because there were more computers available). I went to their office to complain and they explained that they needed the computers to test this new framework called MapReduce. After they described what it would do, I was no longer unhappy because I realized I could wait for them to finish, and then rewrite everything I’d been building with much less complexity using their code. The crazy thing about MapReduce was that it was designed to solve just one problem (inverting links to build the index), but it turned out to be an incredibly expressive computational building block sufficient for almost every large scale data problem. There were certainly things that could not be expressed easily in it (muti-key joins, stateful updates…), but it became like a logic or an algebra that could be composed to cover most of the space that was important.

Over the years, I’ve thought a lot about why it was so effective. Most of us wanted to take computing techniques that we were familiar with for single computers and try to extend or generalize them to the large scale distributed setting. But MapReduce was very different because it started from the question of what could be done efficiently and reliably. It was part of a paradigm of computing based on transformations of immutable data into sorted records. The key insight in this paradigm: sorting > hashing.

Most developers only know of the Hadoop implementation which I am very critical of because of unfortunate design and performance problems. This is the downside of network effects. Sometimes bad technologies win. I had coffee with one of the founders of Cloudera years ago and I asked him how his customers could accept such a poor implementation. Ever the salesman, he responded “I tell them if it’s this bad now, imagine how much better it will get.” But that’s not how things usually work … bad design is hard to fix.

The simplicity of the fundamental operations (sorting and hashing) means that it is possible to build it yourself. And the simplicity further means that it’s possible to aggressively optimize the implementation. MapReduce is essentially just a parallel external sort, and there are a lot of clever ways to make that more efficient. For example, Knuth wrote about algorithms that were developed to make sorting tape drives more efficient back in the 1960’s (to use limited working memory more efficiently). He said that even though the technology is no longer relevant, the mathematical analysis was so beautiful that he decided to keep it in his books. Well, it turns out to be useful again and I’m very glad that he kept that analysis. (By the way, if you study your Knuth, then you will also learn why sorting is so useful at scale … it is more efficient to sort than to hash even if all you care about is bucketizing).

Another trick that I use is to maintain the intermediate data in byte-reversed order. This lets me interpret the data as streams of little-endian integers which lets me do 8x as many comparisons per cycle while preserving the same lexical ordering. I imagine the Hadoop community would protest that this requires commitment to lexical comparisons rather than a pluggable comparator. But there is a cost to generality, something which is hard to appreciate when looking at how rich an ecosystem is. At this point, I’ve probably written and run more mapreduces than just about anybody else on the planet. And after all this time, I’ve never needed anything more than lexical.

This post is already too long. My main point is that the simplicity and expressiveness are what justify investing so much work into it. Only when something is simple can you make it really efficient, and in this case it’s also extraordinarily expressive.

--

--

Oscar Stiffelman

I was an early google engineer. Now I think about (and sometimes work on) prediction.