Overview of efficiency concepts in Big Data Engineering

Julien Kervizic
Hacking Analytics
Published in
11 min readMar 7, 2019

--

Big data operates in a different ways than traditional relational database structures, index and keys are not usually present in Big data systems, where distributed systems concerns tend to have the upper hand. Nevertheless there are specific ways to operate big data, and understanding how to best operate with these types of datasets can prove the key to unlocking insights.

Map Reduce

Map-reduce is one of the fundamental paradigm of Big data. Understanding Map-reduce gives insight as to how parallel operations and processing work at large scale.

As its’ name indicates Map-reduce is based on two sets of tasks a map task and a reduce task.

The Map task is responsible for “mapping” key-value pairs, essentially translating a set key values pairs into a different domain, ie: an intermediate set of key-value pairs for processing purposes.

Let’s take the example above and let say we wanted to count the number of times each category occurs in the dataset. By default the input would considers each row as a new key and the content of the row as its value, the map tasks role, in this…

--

--

Julien Kervizic
Hacking Analytics

Living at the interstice of business, data and technology | Head of Data at iptiQ by SwissRe | previously at Facebook, Amazon | julienkervizic@gmail.com