Did Google Send the Big Data Industry on a 10 Year Head Fake?

In the data chronicles, The Google File System paper in 2003 marked a seminal moment for software development. It was one of the first times Google detailed the technical underpinnings that helped it catapult from clever idea to global force.

The paper showed the world how Google managed all of its data, in a distributed fashion, using the Google File System. Distributed computing was not new at the time, but the Google File System demonstrated the power of distributed computing to a wide technical audience.

The following year in 2004, Google shared another paper on MapReduce, further cementing the genealogy of big data. MapReduce was a new technique to move computation to data, and it allowed large web companies including Google to operate with enormous amounts of information, such as the entire internet. Soon after that, Doug Cutting, Hadoop’s initial creator, began to implement MapReduce and the Hadoop Distributed File System while at Yahoo, and in 2006 Hadoop 0.1.0 was released.

Six years later in 2012, Hadoop 1.0 became available.

Now in 2017, the Hadoop hype has come and gone.

But let’s jump back to 2012. What was Google doing then? It was talking about what we know today as Spanner: Google’s Globally-Distributed Database. From the Spanner team, “Spanner supports general-purpose transactions, and provides a SQL-based query language.” Note this is different from Hadoop’s original plan, and we will pick this up again.

Now let’s jump forward again to May 2017, when the Google team released the second major Spanner paper, this time titled, Spanner: Becoming a SQL System. The authors state:

Google’s Spanner started out as a key-value store offering multi-row transactions, external consistency, and transparent failover across datacenters. Over the past 7 years it has evolved into a relational database system. In that time we have added a strongly-typed schema system and a SQL query processor, among other features.

Note the reference to 7 years earlier which was 2010. Even before Hadoop 1.0 was released in 2012, an implicit nod to the Google File System and MapReduce, Google was recognizing the limits of non-relational systems, and began to course correct seven years ago.

Industry watchers will note that Hadoop was intended to be analytics-focused. The tweet above sums up that endeavor. Spanner began and has evolved as transactional in nature, so perhaps different approaches. However, the Spanner team notes in its most recent paper:

We describe replacing our Bigtable-like SSTable stack with a blockwise-columnar store called Ressi which is better optimized for hybrid OLTP/OLAP query workloads

Spanner’s ambitions go beyond transactional workloads to analytics as well. These mixed workload environments are the most efficient at processing data in real-time, as they often eliminate the need for multiple datastores and a necessity to move data between them. You can ingest, store, and analyze data in one place.

The timing of technical papers from Google does seem ironic, given the rise and reality of Hadoop. Especially poignant is that even before the industry had caught up to the first wave of technology Google shared, Google was well on to the next. As the second most valuable company in the world today, they are doing several things right.