Overview of efficiency concepts in Big Data Engineering

Published in

Hacking Analytics

11 min readMar 7, 2019

Big data operates in a different ways than traditional relational database structures, index and keys are not usually present in Big data systems, where distributed systems concerns tend to have the upper hand. Nevertheless there are specific ways to operate big data, and understanding how to best operate with these types of datasets can prove the key to unlocking insights.

Map Reduce

Map-reduce is one of the fundamental paradigm of Big data. Understanding Map-reduce gives insight as to how parallel operations and processing work at large scale.

As its’ name indicates Map-reduce is based on two sets of tasks a map task and a reduce task.

The Map task is responsible for “mapping” key-value pairs, essentially translating a set key values pairs into a different domain, ie: an intermediate set of key-value pairs for processing purposes.

Let’s take the example above and let say we wanted to count the number of times each category occurs in the dataset. By default the input would considers each row as a new key and the content of the row as its value, the map tasks role, in this…

Overview of efficiency concepts in Big Data Engineering

Map Reduce

Written by Julien Kervizic