Notes: Hadoop Platform and Application Framework

These are my notes from the course given by UC San Diego at Coursera online course platform. Actually, I already know these concepts and worked on these a lot while I was working with a Hadoop cluster at Vodafone Turkey. But being refreshed with the training is always a good practice. Also, please consider that there is too many important information in the course, but these are the ones that I needed to write again.

Sqoop: Used for migrating the relational database to HDFS. It has a special command which connects to MySql database and initiates running MapReduce jobs for migrating RDMS data into HDFS. By defining parameters as avrofile and warehouse path, the data will be ready to analyze over Hive or Impala queries. But, we need to put automatically created schema files to the HDFS before running the queries and then create tables using those schemas.

Hive and Impala are both SQL like scripting languages that are used to query the HDFS data. Even though they use the same metadata files, the difference is that Hive is executing queries with MapReduce jobs; on the other hand, Impala directly performs data analysis on the HDFS files. As a result, Impala is faster in query execution than Hive.

Beeline enables to create a JDBC connection to Hive tables on the terminal (shell).

With Hadoop 2, there are multiple name nodes rather than a single node as it was in the first Hadoop. It increased namespace scalability. Each name node has its own block pools. Moreover, it brings High Availability feature for the Name Nodes and the Resource Manager (to overcome single point of failure). Also, HDFS can use additional storage types such as SSD and RAM_DISK.

Hadoop 1: MasterNode (JobTracker, NameNode), Compute-Datanodes (TaskTracker)

Hadoop 2: With YARN, job scheduling and resource management are separated. Now there is a Global Resource Manager. In each node, there is a Node Manager. And for each application, there is an Application Master. For each job submitted by the client, an Application Master is assigned in a Data Node, and that Application Master allocates containers from its own data node or from other data nodes. The containers are communicating with Application Master, and Application Master is communicating with Resource Manager. It reduces the workload of Resource Manager.

For the tasks that cannot be executed or can be executed but with high a cost (lots of mappers and reducers) with classical MapReduce approach, there are special engines, namely TEZ and SPARK. TEZ engine decreases overall mappers and reducers and enables faster processing. It also supports Directed Acycled Graphs(DAG). On the other hand, Spark enables advanced DAGs, and as well as Cyclic data flows. Spark jobs can be created with Java, Scala, Python and R. The most important benefit of Spark is that it enables in-memory computing, which increases the speed of iterative algorithms such as Machine Learning algorithms.


Originally published at Emre Calisir.