What went far wrong with big data?

3 min readNov 17, 2019

TL;DR

Baggage of historical choices and JVM/GC. Let’s learn lessons from Telco industry.

Hadoop eco-system. Many components were written in Java. Image from here.

A) Baggage of historical choices

1. Hadoop

The story starts from Hadoop Java implementation backed by Yahoo. The movement started from the MapReduce paper from Google. The cloud computing community got very excited and Yahoo’s Doug Cutting who was an open-source contributor made this project public. Hadoop main focus was its file-system: HDFS. The first working iteration was as small as 5,000 lines of code.

However, Google’s MapReduce was a C++ code that never got published and its focus was to solve the divide-and-conquer data processing on a network-based disk without excessive IO. Nowadays, successful companies like MapR re-implemented these Java-based open-source projects again with C++ as their intellectual property. MapR offers 6x more speed, 30% less cost. What a surprise!

At the same time, the community got excited and created connectors for Hadoop based on its nice interface. There is a big industry for this area, and there are companies like Simba, Confluent* and, etc.

2. Spark

Then, we saw new Hadoop (mapreduce v2) and then Spark. Spark at first was similar to Hadoop, however, its promise was to keep data in RDDs. A resilient distributed data structure that could live on multiple machines. This advantage gave Spark 10x performance gain over Hadoop. However, to use “Hadoop eco-system benefits”, i.e. the connectors, many parts stayed untouched. It is famous example about Spark’s Json old library that many of us have tried to override it. Also, the challenge of distributed class-path and submitting a fat JAR file to avoid run-time errors.

B) JVM and GC

JVM can be as performant as C++ if it’d be used right. That’s true. I have witnessed it at eBay when working with my experienced colleagues. On the other hand, in practice, I have seen rare cases of JVM-based developers who would care about the scope of their variables, immutability and thread-safety and long if/else blocks.

This is not a big deal with API development of small components, because Garbage Collector (GC) is advanced enough that could find popular patterns of these issues. How about a big data project with millions of events processed each second? Shall we expect to GC cleanup after us? Well, the death of executors due to GC slowness, is one of the first errors that rookie big data developers would experience.

Although, Java is a strong language that you can develop many projects with it, you can do the same with Python nowadays, but do you really do that? Java’s promise was to write a code once and ability to run on any hardware using Java virtual machine. At that time in 1990s, CPU architecture was a crazy topic, specially considering corporations with tonnes of various hardware. But do you really need that extra layer to run your code on a CPU architecture that stayed somewhat similar for many years? (Hint about architecture: ARM, x86 or x64) Docker does it for you, if you need that portability.

C) Lessons from Telco

Telecom industry had big data challenges from long time ago. Imagine processing millions of SMS-messages everyday or monitoring Petabytes of cellular data. How did they solve it?

They used tools, programming languages and designs that would minimize the risk of making implementation mistakes. Erlang, C and C++ are the programming languages. For example Ericsson designed Erlang to for the purpose of massive telephony communication systems. The programming language should force some design practices.

The movement has started. Someone implemented Spark with Rust (a C++ like language) and called it FastSpark. The initial benchmarks seem to be very impressive.

This idea might looks like madness to you, but the choice of tools as foundation grounds for big data projects was not an option. I can imagine that C++ in 2000s was not easy to use at that time for the single-handed founder of Hadoop and it would have hurt the hype of corporate Java developers adoption, but we should think again about the foundations again or the Hadoop eco-system will fade with lots of big data jobs associated with it.

PS:

Confluent main business is Kafka and its connectors but still there are many Hadoop-like connectors.

What went far wrong with big data?

TL;DR

A) Baggage of historical choices

B) JVM and GC

C) Lessons from Telco

Written by Saeed Zareian