Incremental Computation on Hadoop and MapReduce at Scale

MapReduce framework is not designed for incremental computation. Systems with incremental computation require processing of the large-scale datasets on their own that get added over to the system and the existing and historic entries get deleted or modified due to the evolving dynamics. Google’s Percolator is a tool that can perform the incremental computation. Incoop is initial and generic MapReduce framework that can be leveraged for incremental computations. The advanced data analytics tasks performed by the search engines on world wide web such as web crawl to build a web index or to run the PageRank algorithm will only detect a normal data-scale data changes with the delta mechanism of old and new data with a speed range of 10 to 1000 times.
The incremental MapReduce framework can be applied to several fields such as web crawls, PageRanking, life science computing, graph processing, text processing, machine learning, data mining, and relational data processing. The development of parallel algorithms through IncMR framework that can embed within the original APIs of MapReduce to avoid redesigning the APIs or writing new application algorithms to leverage incremental MapReduce framework. These programmatic algorithms can aid incremental data processing by detecting the data modifications in the inputs and reverse the intermediate states of the data, bolster the map and reduce functions. The algorithms are quick to detect any new inputs to trigger autonomous jobs to the master node.
Following are few principles to leverage from MapReduce Framework and Hadoop for incremental computation with different approaches such as continuous bulk processing, incremental algorithms, and IncMR framework:
· Usually, MapReduce stores the input of the jobs on HDFS file system. However, with Incoop, the data is stored on Inc-HDFS system that sifts through the data modifications on the jobs. The resources are allocated dynamically based on the state of the objects and inputs.
· During this phase, the principle is to gain the control over the granularity of the tasks by breaking down the larger tasks into smaller chunks for reusing the tasks to detect the data modifications.
· IncMR framework introduces an extension of job scheduling options by introduction of one-time run, full initial run to load all the data in the initial phase, delta run detects the incremental changes from the previously computed results and add the newly discovered data by updating the state of the objects based on the current run. Depending on the computing field, some jobs require the continuous run of the jobs for continuous additions of the inputs to the jobs to add and submit this new data to the jobs.
· This process optimizes the distribution of workloads across commodity clusters the movement of data by detecting the available machines and redistributing the computed results leveraging the locality of the machines with the earlier computed results than distributing the workloads.
· Incoop as part of MapReduce framework monitors the log entries added and query process these entries with Pig framework in addition to the hot data arrived in real-time.
Adopting the principles of self-adjusting computation to the paradigm of MapReduce in Hadoop framework can allow initial MapReduce and Hadoop to work without any major changes to the system. This mechanism provides the solution to discover only the incremental changes to the system and perform recomputation. Self-adjusting computation paradigm discovers the delta changes to the sub computations for the change propagation. The sub component gets rebuilt. However, prior to every rebuild, the change propagation paradigm technique runs memoization to perform a recovery operation on the sub computations for the reuse. The stability of the computation determines the efficiency of the self-adjusting computation to discover the data modifications to the inputs. When many of the sub computations performed appear to be similar on the homogeneous datasets , the stability of the self-adjusting computation is considered to be high for reuse of the sub computations . Following principles can be implemented to ensure higher stabilization of the self-adjusting computation:
· Each computation has been divided into smaller chunks of sub computations.
· The computations can have dependencies. However, the dependencies are minimal without huge hierarchical computational dependencies.
· Any parallel programming paradigms such as initial MapReduce framework and Hadoop naturally tend to maintain low-level dependencies to allow the mappers and reducers to run the workloads in isolations. However, any changes to one input to the system can lead to large number of changes to the reducer tasks and these can be extremely large and dependent on the size of the reducer tasks.
In order to avoid huge changes to the system by applying Incoop, retaining the basic of the design of initial MapReduce framework and Hadoop will yield better results. The basic design approach also addresses the conundrums of transparency and efficiency. Self-adjusting computation will need an interface to track the updates to the inputs to trigger cascading incremental updates to the output. Incremental HDFS with content-based chunking and Incremental MapReduce with incremental map and incremental reduce strategies aid initial MapReduce framework with Hadoop.
References
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Proceedings of the 2nd ACM Symposium on Cloud Computing. http://dx.doi.org/10.1145/2038916.2038923
Sakr, S., & Gaber, M. (2014). Large Scale and Big Data: Processing and Management. Boca Raton, Florida: Auerbach Publications.
Yan, C., Yang, X., Yu, Z., Li, M., & Li, X. (2012). IncMR: Incremental Data Processing Based on MapReduce. Retrieved April 27, 2016, from http://www.s3lab.ece.ufl.edu/publication/cloud12.pdf