Distributed Computing and Apache Hadoop: Why? and How?
This blog is going to explore WHY a project like Apache Hadoop was created, and HOW it negotiates resources in a distributed computing infrastructure. In the end of this blog, you will be able to understand the fundamental reasons of why we needed the distributed computing platforms, and how Apache Hadoop implement distributed computing. We will specially focus on resource negotiation in the distributed computings nodes.
This question is if you you were born in 90’s. Do you remember when it was all about CDs or DVDs, and downloading anything from internet was like super slow 😩 …
By then, softwares were written and released targeting personal computers. There was basically nothing called software updates.
Then came the internet as we know it.
Internet connected millions and billions of users and devices, in ways people never imagined, ever. Social media, E-commerce, Travel apps, E-learning platforms were widely adopted by the mass crowd and became an integral part of our lives.
With that, people started generating more data.
No one knew the sheer scale of this data, until it started generating. Every click a users do online can generate some sort of data. Every video, image, text shared, generates data.
People wanted to make sense of this but, computing paradigms available at that time could not process such amount of data efficiently.
In order to solve this, distributed computing paradigm came into the picture. It identified this massive amount of data as Big Data, and promised to provide tools to process it efficiently.
Distributed computing paradigm
Distributed computing utilizes clusters of computing nodes and provide computing power to process applications involving data at a much large scale, like Big Data.
Various clients can submit jobs with a massive amount of data to the computing cluster, and obtain processed information whithin a practical amount of time.
Apache Hadoop
Apache Hadoop is an opensource software project implementing a distributed processing framework. Its architecture involves a component called Resource Manager, which manages a set of computing nodes in a computing cluster.
Each job submitted to the Hadoop can have a processing logic, and a massive amount of data. In order to utilise the nodes effectively, this processing logic needs to be written in a special way, where in Hadoop this is know as MapReduce.
Hadoop requires job logic to follow the MapReduce architecture which is essentially a way of splitting a massive amount of data into practical sized slices, process each slice separately, and aggregate results to obtain the final outcome.
In addition to that, Hadoop uses a special file system called HDFS(Hadoop Distributed File System) to distribute Big Data into nodes.
How Hadoop operates
First, clients submit jobs to Hadoop.
These jobs are then processed by the Scheduler component. Based on that, Hadoop decides the way jobs are prioritised and how resources are allocated to the jobs. Hadoop provides different scheduling policies by default, which is also extensible.
Once the next job is decided, job data is copied to the HDFS which then distribute slices of the data among the cluster nodes.
HDFS has a parameter called replication factor, which ensures that the same data slice is duplicated multiple times. This essentially makes HDFS fault-tolerant, incase of node failures. The default factor is 3.
After that, the App Manager component in the Resource Manager talks to computing nodes via a per-node specific component called Node Manager. Based on the negotiations made with the Node Managers in clusters, Resource Manager successfully finds a node having enough resources(Processing power, Ram, etc) to create a Container, and then launches an App Master for the job in that Container.
Container — Essentially, a JVM capable of running tasks.
App Master — A task daemon which takes care of the whole lifecycle of the submitted job. This runs within a Container.
Node Master — Per-node specific component which take care of it’s Containers.
App Master then talks to the HDFS, and identifies to which nodes it copied job data slices.
Then it negotiates with those nodes via the Resource Manager to create Containers, and executes MapReduce logic within each Container to process each data slice resides within the node.
Once all Containers finished executing data slices, MapReduce can aggregate processed data to obtain the final outcome for the submitted job.
A Few thoughts on job scheduling
Hadoop provides three policies for job scheduling.
First-in First-out (FIFO)
Submitted jobs are inserted into a queue, and processed in FIFO order. This is less complex and require no configurations, but can pose serious bottlenecks if a large and low priority job comes first but later a high priority task needs to be executed. Then it has to wait until the first job is finished.
Capacity Scheduler
Capacity scheduler is much convenient for a Hadoop cluster that is shared among organizations. This mechanism introduces multiple priority queues for each organization and define pre-allocated slots in the cluster for each of the organizations. Then the scheduler will arrange slots for incoming jobs based on the allocated resources for the corresponding organization. This is much efficient in terms of the usability, but it increases the complexity of configuring.
Fair Scheduler
This is like the capacity scheduler, but instead of queues it has job pools. In this way, high prioritised job in a certain pool gets the priority rather waiting on a queue.
Other than these three, custom policies can be injected into Hadoop as this is an extension point.
In the end
We talked about the fundemental problem which required the Distributed Computing paradigm. Then we explored how Apache Hadoop implements it, and provides an efficient framework to process Big Data.
I hope you got a better understanding about what we talked here. If you have any questions, please do leave a comment below.
Until we meet again in a future blog, Cheers 🍻!