Evaluation of In-Mapper and Combiner Functions in MapReduce

Published in

Princeton Systems Course

9 min readMay 19, 2024

Kiyosu Maeda & Haoran Wan, COS 518, Spring, 2024

Abstract

Since MapReduce processes large amounts of data, it often imposes high load on the network during the shuffle phase. The original MapReduce paper and the subsequent research have proposed solutions by aggregating intermediate results locally in Map phases, using Combiner and In-Mapper functions, to mitigate network problems. Despite their potential efficacy, they did not analyze how task performance of those solutions were affected by fault tolerance or the number of nodes. We evaluated the task completion time of MapReduce with the Combiner and In-Mapper functions for word count tasks, changing the number of datanodes and injecting node failures. Our evaluation showed that use of Combiner functions can reduce the task finish time by 40.7%, 61.8% and 68.5% in 2, 4, 8 nodes clusters respectively. Compared with Combiner functions, use of In-Mapper functions can further reduce the task finish time by 12.1%, 9.6% and 4.1% in 2, 4, 8 nodes clusters respectively. However, the Combiner and In-Mapper conditions dropped task performance by 13.17% and 14.46% respectively under machine failure more than the Baseline condition.

Introduction

MapReduce is a programming model that is designed for generating and analyzing large datasets with distributed systems. Its abstraction enables users to easily handle parallelization, fault tolerance, and load balancing while using a cluster with multiple machines. Due to its advantages, MapReduce has been adopted for myriad tasks, such as calculating PageRank and analyzing frequently accessed URLs, with various platforms (e.g., Hadoop).

However, MapReduce puts a high load on the network, since it needs to read, write, and send large amount of data over the network. To deal with this problem, the original MapReduce paper proposes a Combiner function that merges an intermediate output of each map task to reduce data sent over the network [1]. Following research has further improved the algorithm and proposed an In-Mapper function that stores and merges intermediate results of a map task in memory [2]. They found that both Combiner and In-mapper functions allowed systems to reduce task completion time.

Despite the efficacy, In-Mapper function is not necessarily effective, since it causes additional read/write and memory overhead. Furthermore, it might negatively affect task performance when there are machine failures. When a machine is killed during a map task, for instance, they lose intermediate results that are stored in memory and need to run the task again. Another situation where In-Mapper function might not be effective is when the number of nodes in the cluster is relatively large, since each machine takes less data for processing, making the disk writing less expensive for each node.

In this report, we try to understand how node failures and the number of nodes affect the task performance of Combiner and In-Mapper functions. We compare four different conditions: Baseline (without Combiner nor In-Mapper), Combiner, In-Mapper, and Combiner+In-Mapper.

We found that use of Combiner functions can reduce the task finish time by 40.7%, 61.8% and 68.5% in clusters with 2, 4, 8 nodes respectively without machine failure. Compared with Combiner functions, use of In-Mapper functions can further reduce the task finish time by 12.1%, 9.6% and 4.1% in clusters with 2, 4, 8 nodes respectively. On the other hand, machine failures negatively affected the performance of the Combiner and In-Mapper compared to the Baseline.

Algorithms

Combiner

Each Mapper writes intermediate key-value pairs to the disk. Then, these intermediate outputs are sent over the network to be processed by Reducers. This original pipeline often imposes high load on the network. The original MapReduce paper proposes a Combiner function to solve this problem. The Combiner function aggregates intermediate key-value pairs to reduce the size of outputs before sending them to Reducers. Figure 1 describes how an example input is processed with Mappers and the Combiner on a simple word count task. In this example, two Mappers first process input sentences (“advanced computer system” and “distributed system distributed system” respectively) and produce intermediate results. Then, the Combiner function aggregates results and the counts of words “distributed” and “system” become “2” in the end.

In-Mapper

While the Combiner function reduces the amount of data sent over the network after Mappers, one of the biggest disadvantages is that the Combiner function requires unnecessary object creation/destruction or serialization/deserialization as mentioned in [3]. To address this issue, Lin and Dyer [3] proposed an In-Mapper design pattern that holds key-value pairs in memory during a map task and finally produces an intermediate output only after the map task is finished. The In-Mapper even loses additional read/write overhead the Combiner function has. The algorithm below shows a pseudo code of the In-Mapper for a word count task.

## Algorithm of In-Mapper
Class Map
  Initialize countMap <- new AssociativeArray
  method map
    for (word in words)
      if word not in countMap
        countMap[word] <- 1
      else
        countMap[word] <- countMap[word] + 1
  method cleanup
    for word, count in countMap
      emit (word, count)

Despite its efficiency, the In-Mapper also has disadvantages. One of the drawbacks is that it requires more memories during the Mapper. As one machine processes more data, it consumes a lot of memory space. Furthermore, since the In-Mapper holds results in memory, it might lose data before writing to the disk and need to run the whole process again when a node fails. Previous work did not measure the task performance of the In-Mapper under failures. We are also interested in how the performance of these algorithms change as the number of nodes increases. The number of nodes affects the amount of data each node will process, which might indicate that the Combiner and In-Mapper will be less effective as the nodes in a cluster increase.

Implementation

We leverage Hadoop to run MapReduce on Amazon EC2. We simply use the job.setCombinerClassfunction, a built-in function to set Combiner, to register the Reducer as the Combiner function.

We employ t2.xlarge EC2 instances, each instance has 4 vCPU, 16GB of memory and 64GB of EBS disk. All the EC2 instances are in the same region, which means that they are in the same data center but not necessarily in the same rack, this setting can better mimic the real production environment.

Evaluation

Conditions

We compared the following four conditions:

Baseline: Run MapReduce without Combiner nor In-Mapper
Combiner: Run MapReduce with the Combiner function
In-Mapper: Run MapReduce with the In-Mapper function
Combiner+In-Mapper: Run MapReduce with the Combiner and In-Mapper function

For each condition, we changed the number of nodes (two, four, and eight). We also reboot nodes once during the map phase to evaluate fault tolerance of each condition. The fault tolerance was measured only with four node cases.

Tasks

We measured the average task completion time of each condition for a word count task. We used a subset of En-Wiki dataset that is preprocessed for the MapReduce word count task. The size of the dataset was 12.84GB and it was partitioned into 32MB shards. We ran MapReduce five times for each condition and calculated the average task completion time and its standard deviation.

Results

End-to-end Task Finish Time

We ran the MapReduce tasks with clusters of 2, 4, and 8 nodes to evaluate the efficacy of Combiner and In-Mapper. For each algorithm, we repeat the computation for 5 times and record the mean, minimum and maximum of end-to-end finish time.

Figure 2. Tasks finish time of different algorithms.

Figure 2 exhibits the end-to-end tasks finish time for different algorithms with different nodes, where the top and bottom whiskers indicate the maximum and minimum finish time in 5 repetitions. As we can observe, the task finish time within all different scenarios varies in very small margins. As expected, Combiner can significantly reduce the task finish time by 40.7%, 61.8% and 68.5% on average in clusters with 2, 4 and 8 nodes respectively. Furthermore, the In-Mapper can reduce the task finish time by 12.1%, 9.6% and 4.1% on average in clusters with 2, 4 and 8 nodes respectively. The improvements of In-Mapper over Combiner shrink as more nodes are added into the cluster, because each node has a smaller portion of data and improvements brought by aggregating the data in memory over disk are smaller as elaborated in previous sections. For the Combiner+In-Mapper scenario, since both the Combiner and In-Mapper functions work on each partition of data, using both Combiner and In-Mapper is redundant, making the end-to-end task finish time a little bit worse than the In-Mapper functions. Our initial thought is that the Combiner functions will produce an output of global aggregated data, which is proved to be wrong here.

Fault-tolerance

To evaluate the performance of each condition with machine failures, we intentionally disconnected one node from a Hadoop cluster for one minute during the Map phase. We ran five MapReduce jobs with four nodes for each condition and analyzed the task performance drop rate, which represents how much worse task performance becomes on average from one with no machine failure in the previous section. The task performance drop rate was calculated as follows:

In the equation above, t1 is the average execution time with machine failures and t2 is the average execution time without machine failures.

Figure 3. The performance drop rate by injecting machine failures

Figure 3 shows the result of the evaluation. Higher values indicate worse performance drop from non-failure settings. The performance drops of the two conditions, Combiner (Mean = 13.17%) and Combiner+In-Mapper (Mean = 17.79%), were higher than that of the Baseline condition (Mean = 11.99%). When a machine fails during the Mapper with Combiner phase, the system needs additional calculation in other Mapper or Reducer phases, which would affect the performance drop. Moreover, the In-Mapper conditions (Mean = 14.46%) dropped task performance more than the Combiner condition. One reason for this result is that the In-Mapper function temporarily stores intermediate results in memory, which imposes additional overhead when a machine fails. However, considering the differences in performance drop rate between the Baseline and the other conditions were within 6% and the task completion time was shorter in the Combiner, In-Mapper, and Combiner+In-Mapper than in the Baseline condition, machine failures were not critical in terms of task performance.

Individual Map Task Finish Time

We got records of each map task’s finish time and the Combiner execution time from the Hadoop system and plotted their distribution.

Figure 4. Single map task finish time of different algorithms.

Figure 4 shows the results, where the violin plot shows the distribution of map tasks finish time, the center plot, top and bottom whiskers indicate the mean, 75 percentile and 25 percentile of the distribution. Since the Combiner functions process the data from the disk and In-Mapper functions aggregate the data in memory, In-Mapper functions’ individual map finish time is a little bit shorter than the Combiner functions’. Also, the Combiner+In-Mapper functions’ individual task finish time is a little bit worse than the In-Mapper functions, where the reason is the same as the end-to-end finish time.

Conclusion

In this project, we evaluated the effectiveness of different data aggregation methods in the MapReduce framework — Combiner, In-Mapper and Combiner+In-Mapper. The use of data aggregation functions can significantly improve the task finish time over the baseline method without any combiner function. Furthermore, In-Mapper functions can further improve the performance of Combiner functions. However, the effectiveness will decrease as the number of nodes increases in the cluster. In summary, it’s worth using Combiner or In-Mapper functions in tasks such as WordCount and PageRank, which can significantly improve the end-to-end task finish time with a minor sacrifice on the fault-tolerance performance.

Reference

[1] Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: Simplified data processing on large clusters.” (2004).

[2] Jimmy Lin and Michael Schatz. 2010. Design patterns for efficient graph algorithms in MapReduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs (MLG ‘10). Association for Computing Machinery, New York, NY, USA, 78–85. https://doi.org/10.1145/1830252.1830263

[3] Jimmy Lin and Chris Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers.