How Map Reduce Let You Deal With PetaByte Scale With Ease

Kartikeya Sharma
Analytics Vidhya
Published in
10 min readOct 17, 2020

--

Map Reduce is the core idea used in systems which are used in todays world to analyse and manipulate PetaByte scale datasets (Spark, Hadoop). Knowing about the core concept gives a better understanding of why these systems do certain operations on data at certain stages. Which in turn helps design efficient processes while manipulating the data using those systems. Why data was repartitioned using user_id somewhere before join. Why people are told to avoid reduce tasks as much as possible (Yes, it’s expensive but how much and why).
The goals of this article is to understand how MapReduce system works using an example on characters frequency count.

Contents

  • Distributed System Abstraction
  • MapReduce
  • Get Frequency of Characters on a Distributed System
  • MapReduce System Architecture
  • Scalability
  • Thoughts on Optimization
  • Resources

Distributed System Abstraction

Here I’ll present a distribution system abstraction that will be enough for understanding the sections that follow. I will be hiding most of the complications that arise in making/working of these systems.

Distributed system

Node represents a computer which has some computation resources (Processor) and some storage resources (RAM, Hard Disk). And the switch and arrows represents the fact that the nodes are connected over a network and can communicate with each other to pass data and instructions.

For the time being we will assume that nodes don’t fail and network is always reachable with sufficient bandwidth to not to cause any congestion.

The following figure illustrates a general scenario of a system that deals with data.

Most common scenario on how data resides in the world

The data exist in some storage (Amazon S3, Azure storage accounts) (S3 is just a bunch of computers that are specifically designed to store data) and we have some compute nodes which can get this data to do desired operations and in some cases write it back to the storage. The cloud service providers also provide systems which have both compute and storage (Amazon redshift), but this is quite expensive if the system is not used with high throughput. In that case it is better you spin up the compute cluster and get computation done and bring down the cluster.

So the idea above is to take the data from storage and distribute it over the compute cluster nodes which have their own storage space and then do computation on that data and send back the results to the storage.

MapReduce

MapReduce as the name suggests has two tasks.

  • Map Task: apply some operations on the data and form key value pairs for the data.
  • Reduce Task: for a specific key take all the values and apply some operation over those values.

Is that it? Not really! Let’s understand the working of the system using an example.

Get Frequency of Characters on a Distributed System

The problem is simple we have multiple files in a distributed system containing one character per line and we want to know how many times a character exists in entire set of files.

In the entire section I’ll be describing the operations that need to be done. In the next section we will get to know, what takes this job and how it performs that. Note that applying the operations efficiently isn’t the goal. It’s to understand what are the operations that need to be performed.

A lot of optimizations will come to the mind while reading ahead, and in practical systems most of those are in actual applied. The only thing to note is that with those optimizations comes some complications and we here don’t want to divert our minds to figure out how to make those work. I’ve introduced a section to discuss a few of those.

  • Initial data: The following image represents how data looks in the compute cluster, There are two worker nodes (We will call these nodes as Map workers as these workers will be responsible for performing map task on the data) in our cluster and those contain two files each with one character per line.
Initial data
  • Map: For each character create a key value pair where key is the character itself and value is 1 always.
After applying map operation on entire data
  • Grouping: The data we got above has a lot of repeating keys, So let’s combine the keys and create a list with all the values associated with those keys. We will combine the content in the files to get a single file on each worker.

The problem here is that Worker 1 doesn’t know if key A exists somewhere else. So based on the data it has, it can not tell the exact count for that character. The solution that will come to mind is that let us send all data over to one worker which can do further aggregation to obtain final results. The problem with that approach is that the data might be too large to fit on a single system and what does the other system do during that time?

If we could share almost equal work among the two workers we can reduce the amount of data we might have to send to half and at the same time compute on the data in parallel reducing time taken by half. It’s a win-win situation, but how to achieve this?

Let’s consider we have spawned two other workers Worker 3 and Worker 4 for this job (We will call these workers Reduce Workers as these workers will perform reduce task on the data).

So now we know that the data needs to be shared on two systems, Now lets choose based on the key to which worker the data should go. We want to make sure that data with one key goes to only one worker. What comes to mind is hash function. Before applying a hash function lets convert characters to numbers which will make it easier for us to construct a hash function. Let us consider the ASCII value of a character for this purpose

Let the hash function be (ASCII value for the character) mod 2 . If value is 0 we send to Worker 3 and if its 1 we send to Worker 4 (If we have n reduce workers we can change mod 2 to mod n).

A: 65 (65 mod 2 = 1): Worker 4
B: 66 (66 mod 2 = 0): Worker 3
C: 67 (67 mod 2 = 1): Worker 4
D: 68 (68 mod 2 = 0): Worker 3
E: 69 (69 mod 2 = 1): Worker 4
F: 70 (70 mod 2 = 0): Worker 3

The data after this operation looks as follows, for each reduce worker a separate file is created at the map worker and then these files are sent to the reduce workers for application of reduce operation.

Files for reduce workers at map workers

The data at Worker 3 and Worker 4 looks as follows:

Data files received at reduce workers
  • Reduce: The reduce function will get all the values for a key and sum those values. e.g. at Worker 3 for key B we will get the values from two files
B: [1, 1, 1] sent by worker 1
B: [1, 1] sent by worker 2

The reduce function will take this key and will sum over the values giving 5 as output. The following image shows the final outputs

Output after applying the reduce operation

The output above can be written to some external storage where it can be combined into one file or kept as two file parts.

MapReduce System Architecture

Here I’ll describe one possible architecture for a map reduce system. Different systems will have variations around this basic architecture.

  • Master Node: Master node is responsible for monitoring and managing the other nodes. The master node knows how many map worker are there and how many reduce workers are there. It knows all required details to communicate with those. The user who creates the cluster, can usually configure the number of workers or different types. In the diagram responsibilities of this node are listed. Client (user) will connect with the master node and will send over the code which will contain functions (or objects in some cases) defining the map and reduce operations. Master Node will take this code and will send it over to worker nodes so that the worker nodes know based on the role assigned (Map worker/ Reduce worker) what code to execute on the given data. Master node is also responsible for initially distributing data over to map worker nodes. After map worker is done with its tasks. Master node will trigger a data push (can also be a pull from reducer nodes) to reducer nodes.
  • Map Node: Map node is responsible to take the initial data and forming key-value pairs which can be consumed by reducer function. Map workers will get the required information on how many files to create as output for reduce workers. Its output is a file for every reducer.
  • Reduce Node: Reduce node is responsible for applying reduce function over the data it received from the map nodes. Once it generates the output that output can be sent to data store or master node as required.
  • Client Node: Client node is the node where developer will write code and will use it to trigger jobs in the cluster by sending code to the master node at the cluster.

Scalability

The worker nodes during the map or reduce tasks don’t have to care about data at other nodes. This independence makes parallel computation easier as we don’t have to worry about locks or shared data which can cause deadlocks or slow down execution. Within a worker, we in general will have a few cores and each core can be running multiple threads. We can create enough number of small files(partitions) to make sure all these resources are utilized maximizing throughput. The data doesn’t have to move across the cluster before application of map tasks. The data needs to be distributed to reduce workers for application of reduce tasks to make sure values for a particular key is in one place.

The scalability of this approach comes from the fact that if data increases all we need to do is create more partitions of data and increase number of Map and Reduce Workers. The data will get distributed over multiple machines so more work will be done in parallel.

A point to note is that increasing number of reduce workers might not be a good approach as there will be too many files created at map workers (as map worker creates one file for every reducer), which might cause the files to be sent over the network with very little data for reduce workers to act upon decreasing throughput.

Thoughts on Optimization

As we can notice before sending the data for reduce task we could have summed up the values. And that would have reduced the amount of data that needs to be sent over to reduce workers. The point to note here is that for this the reduce function needs to be commutative and associative. So what we will be doing is simply applying the reduce function over the files that are generated for the reduce workers at the map workers. Or make our map function complicated enough to carry out this operation.

Map and reduce workers are kept separate. But it is possible to use the same workers to apply the map as well as reduce tasks. That way we can utilize the entire cluster both in terms of computation and storage. Also doing so will reduce the number of files that need to be sent over the network. Which will reduce communication cost and hence time taken for the job to complete.

Creating files in a tabular manner might not be ideal for performing the operations. We can think of storing the data for the middle stages in some other formats(JSON, Parquet) which can allow storage structures like maps and lists which provide much efficient access to data. We can also perform encodings and compression over the data to reduce storage space required while also reducing the amount of data that needs to be sent over the network.

Instead of creating files we can try to keep most of the data in main memory (RAM) as we know main memory access is usually 10x to 20x faster as compared to good quality disks today. Spark does use this optimization and by using this and a lot other optimization beyond the scope of this article won the Daytona GraySort contest. Which is about sorting big data where it beat Hadoop by getting the task done 3x faster using 10x less machines.

Resources:

--

--