How to Hadoop

Following understanding the problem Hadoop is meant to fix, I now want to know more about Hadoop. I don’t want to dive into crazy detail… yet.

When we approach the limits of individual computers, we have begun to step into big data. When you start using unstructured data types and data from a variety of sources and hence in a variety of shapes, you need a clever solution.

Unsplash image by Christopher Alvarenga

Hadoop

Hadoop is an open source, java software that is built on Map-Reduce and is named after a toy elephant.

It aims to coordinate parallel processing and does this through giving batches of work to complete. Once all batches are completed, it can then give a response to that query. As such, they are not immediate but are very fast compared. Because of this speed issue, it can not can not handle immediacy like in OnLine Transaction Processing (OLTP) but is good for giving other analytics.

Hadoop is not a replacement for a database, it is a replacement for time (complicated, long computation time).

I will give an example here about “replacement for time” because I don’t like this definition. The New York Times wanted to store all stories from 1851–1922 on their website. They had scanned images which resulted in 4 Terabytes (1,000 Gigabytes) of image data. To host all these is expensive and potentially wasteful if no-one ever looks at them. An employee ingested all these images into Hadoop and Hadoop co-ordinated batches of work to be sent to a load of computers. The output was 1.5 Terabytes of PDF files, over 2/3 reduction in file size and hence cost. This was all done within 24 hours. Without Hadoop it could have taken weeks, months or years to process all those images and convert them to PDFs on a computer. Using Hadoop saved / replaced the time.

Limits of Hadoop

Hadoop is great but it isn’t a silver bullet.

If the data/operation needs to have “random access”, it shouldn’t be applied. WTF does that mean.

Again I am no expert but if I was looking at an excel sheet and you just shouted out particular cells I would be randomly accessing those cells. Essentially, I skip over a load and then do and operation, and do that again and again. Hadoop hates this on its own. It wants a list of “cells” which it can sequentially do those actions to. Again I am no expert can you tell. Please comment if you can help!

If the data/operation cannot be done parallel, it shouldn’t be applied. If actions can be done an item at a time and those do not affect any other items we can run them parallel. For example, we can add all the numbers to 10. It doesn’t matter what order we do them in. We can add them up in step or in any order. You can not do that if have multiplications, as the order matters. Each result is dependant on the ones around it. It can not be done in parallel.

(1+2)+(3+4)+(5+6)+(7+8)+(9+10) = 55
(1+10)+(3+8)+(5+6)+(7+4)+(9+2) = 55
(1+2)x(3+4)x(5+6)x(7+8)x(9+10) = 65,835
(1+10)x(3+8)x(5+6)x(7+4)x(9+2) = 161,051

If the data/operation are really simple, it shouldn’t be applied. It is overkill. Kill two birds with one small thermonuclear device. Swatting flies with a sledgehammer. Don’t do it.

How it works

If Hadoop is an Orchestra, there needs to be areas of players like the Woodwind section. If Hadoop is a football manager, there needs to be a defense and midfield, with players in each section.

Like the analogy above, players can be known as Nodes or more simply the place where some computation will happen like a computer. They are in teams/sections/areas know as Racks.

Why is this a thing? In an orchestra or a team you want you teams/sections/areas/racks to work together as they are doing a similar thing. They need to be able to speak to each other more often than between sections. Same thing in Hadoop.

A cluster is the group as a total — the entire football team, the entire orchestra, every computer you are controlling.

Nodes<Racks<Cluster.

You also need two more things:

  • A distributed file system, like Hadoop Distributed File System (HDFS)
  • A MapReduce Engine — something that can run the MapReduce algorithm — essentially a Java environment (place to run Java files)
  • Some sort of Tracking System, like YARN.

WTF is HDFS

Good question. It is like any normal file system. It is a bunch of folders where data is saved in them, and it allows you to jump about between these folders. HDFS though is a bit clever. It has been designed to be robust. If you have many nodes it may be likely one will suddenly stop for whatever reason. This is a pain as you will have lost some of the data, you won’t know what it has completed etc etc.

HDFS replicates all data sent to that node so it always knows what it was doing at that time and hence can reroute that data to be completed by a working node.

It does this through breaking data down into something called blocks. The default size of this block is 64MB (approx. 128,000 rows of a excel sheet I was using the other day). Though in many implementations 128MB is suggested. This means your huge “big data” file can be broken down and sent to a variety of nodes in these data packets / blocks. These are known as atomic which means Hadoop will never break a block down further. If you have more space on you drive on a node it doesn’t matter — it will not be used if it is less than the block size. If you file size doesn’t divide by 128MB it doesn’t matter you will just have one last block which is undersized.

These data blocks are replicated on multiple nodes, but only as storage.

WTF is MapReduce

It essentially scans/maps your data for key figures, and then organises/reduces it and ready for processing. Here’s a weird video. Here is a better weird video relating to Hadoop.

Why use MapReduce? We it sorts and simplifies processes so they can be streamlined and hence quicker to compute.

WTF is YARN

First, if you have a JavaScript background this is not the Yarn you are looking for. YARN stands for Yet Another Resource Negotiator. It is a resource management and job scheduling tool which has its own name so that, if needed, it isn’t fully tied to Hadoop versions.

To be simplistic, YARN looks at how much work each node can do and supplies the right amount to it. Additionally, Hadoop is always trying to match jobs to locations. This means if there is some data on one node, and a work item to be applied to this data, it will try to place the work item in the same node. If the node is full it will place it in the nearest neighbour in the rack. If it cannot do this it will have to place it in the next rack. This is suboptimal and less common.

Administration

Adding Nodes to the cluster

It is as simple as giving the IP address through a ssh connection.

Add services to a Node

Done through an Ambari Console (UI dashboard for Hadoop). We will talk about services later.

Remove Nodes

Remove the services in Ambari, then remove the Nodes in Ambari.

Check health and Disk Space

Ambari.

Configuration

Hadoop is configured with XML files. These are essentially lists on what you want to run. There are many different types we may need to use. An example of the configurations may be the block size you want to use (64MB or 128MB?).

Services/Components

There are many components and I would learn nothing from just scoring the web for them. Plus I am sure someone has done this already. I am going to look at some challenges and list solutions.

“Why do I have to write a query in Java! It is overly complex for what I want to do!”

Well say hello to my friends Pig (aka Pig Latin) and Hive. Both are higher level languages. Being overly simplistic, essentially Pig and Hive are shortcuts for writing Java. It can be easier to learn and will have most of the functionality for when you get started. Both can also be extended with custom Java if you feel up to it. Pig is its own language, but Hive is similar to SQL. Pig can deal with unstructured data much easier than Hive. Pig can be run within the execution of work, whereas Hive can be run after the work has finished. They both kind of do the same job but have different features and ease depending on what you want to query.

Apologies for how creepy they both are.

“I want to query but I have JSON structured data, why would I use SQL stuff? I like JSON”

Jaql would like to say hello. It also can help with other file formats now which means it is in a similar space to Pig and Hive. Pig and Hive essentially become MapReduce jobs where as Jaql can become something else and hence can potentially do more.

“My node finished its work but was really slow and then died trying to pass its results / log file to me. Why can’t it send data quicker!”

I know something than can help! Flume. It can help gather data and save it in your HDFS. Essentially you say to each Node, I will give you this data and you have to put your results over in this box. “This box”, aka “sink” in big boy terminology, can be your HDFS or it could be to something known as a collector. This collector takes inputs from many places and then does a bulk-conversion process to save it to HDFS or a similar.

“I have done the work now but I need the summarised results/data, how can I get it back!!!”

You need Sqoop. That is not a typo. Sqoop allows bulk transfers from your Hadoop system (typically HDFS) and your structured database.

“I have a lot of parallel tasks but then I need to do something that depends on the previous to continue.”

You need Oozie. Oozie allows you to mark work as a dependant on other work items and hence wont start until the conditions you define are satisfied.

Note: MapReduce is a service/component albeit a vital one for Hadoop. It does the organising of work to help simplify the load.

We are done

So my target was:

I now want to know more about Hadoop. I don’t want to dive into crazy detail… yet.

I am happy with that, I get the services and things and understand some of how Hadoop works.

Detail will be the topic for the next post!