Hadoop

Ubex AI
Ubex
Published in
6 min readApr 19, 2019

--

The Ubex project utilizes the Hadoop software as part of its infrastructure. In this release, we will talk about what Hadoop is and how it is applied.

Hadoop is a project of the Apache Software Foundation, a freely distributed set of utilities, libraries and a framework for developing and executing distributed programs running on clusters of hundreds and thousands of nodes. It is used to implement the search and contextual mechanisms of many high-loaded websites, including for Yahoo! and Facebook. It is developed in Java as part of the MapReduce computational paradigm, according to which an application is divided into a large number of identical elementary tasks that can be performed on cluster nodes and naturally reduced to the final result.

As of 2014, the project consists of four modules — Hadoop Common [⇨] (middleware — a set of infrastructure software libraries and utilities used for other modules and related projects), HDFS [⇨] (distributed file system), YARN [ ⇨] (a system for scheduling jobs and managing a cluster) and Hadoop MapReduce [⇨] (a platform for programming and performing distributed MapReduce calculations), previously Hadoop included a number of other projects that became independent within the framework of the Apache Software Foundation project system.

Hadoop is considered one of the fundamental technologies of “big data”. A whole ecosystem of related projects and technologies was formed around Hadoop, many of which developed initially within the framework of the project and later became independent. Since the second half of the 2000s, there has been an active commercialization of the technology, several companies are building their business entirely on creating commercial Hadoop distributions and ecosystem technical support services, and virtually all major information technology providers for organizations in one form or another include Hadoop in product strategies and solutions.

Hadoop Common includes libraries for managing file systems supported by Hadoop and scripts for creating the necessary infrastructure and managing distributed processing, for the convenience of which a specialized simplified command line interpreter (FS shell, filesystem shell) has been created, launched from the operating system shell with the command: hdfs dfs -command URI, where command is an interpreter command, and URI is a list of resources with prefixes that indicate the type of file system supported, for example, hdfs: //example.com/file1 or file: /// tmp / local / file2. Most of the interpreter commands are implemented by analogy with the corresponding Unix commands (for example, cat, chmod, chown, chgrp, cp, du, ls, mkdir, mv, rm, tail, and moreover, some keys of similar Unix commands are supported, for example, the key recursiveness -R for chmod, chown, chgrp). There are commands specific to Hadoop (for example, count counts the number of directories, files and bytes in a given path, expunge clears the recycle bin, and setrep modifies the replication rate for a given resource).

HDFS (Hadoop Distributed File System) is a file system designed to store large files that are distributed block by block between nodes of a computing cluster. All blocks in HDFS (except the last file block) have the same size, and each block can be placed on several nodes, the block size and replication ratio (the number of nodes on which each block should be placed) are defined in the file level settings. Thanks to replication, the stability of the distributed system to the failures of individual nodes is ensured. Files in HDFS can be recorded only once (modification is not supported), and only one process can write to the file at a time. The organization of files in the namespace is the traditional hierarchical one: there is a root directory, nested directories are supported, files and other directories can be located in the same directory.

Deploying an HDFS instance provides for a central name (name node) that stores file system metadata and meta information about block allocation, and a series of data nodes (data node) that directly store file blocks. The name node is responsible for processing file-level and directory-level operations — opening and closing files, manipulating directories; data nodes directly work on writing and reading data. The name and data nodes are provided with web servers that display the current status of the nodes and allow viewing the contents of the file system. Administrative functions are available from the command line interface.

HDFS is an integral part of the project, however, Hadoop supports work with other distributed file systems without using HDFS, support for Amazon S3 and CloudStore is implemented mainly in the distribution. On the other hand, HDFS can be used not only for running MapReduce tasks, but also as a general-purpose distributed file system, in particular, distributed NoSQL-DBMS HBase is implemented on top of it, and Apache Mahout, a scalable machine learning system, runs in its environment.

One of the main goals of Hadoop was initially to ensure the horizontal scalability of the cluster by adding low-cost nodes (mass-class equipment, English commodity hardware), without resorting to powerful servers and expensive storage networks. Functioning clusters of thousands of nodes confirm the feasibility and economic efficiency of such systems, for example, as of 2011, large Hadoop clusters at Yahoo (more than 4 thousand nodes with a total storage capacity of 15 Pbytes), Facebook (about 2 thousand nodes per 21 PB) and Ebay (700 nodes by 16 PB). Nevertheless, it is believed that horizontal scalability in Hadoop systems is limited, for Hadoop up to version 2.0 was as much as possible estimated at 4 thousand nodes using 10 MapReduce tasks per node. In many ways, this restriction was promoted by the concentration in the MapReduce module of functions for monitoring the life cycle of tasks, it is believed that with its removal to the YARN module in Hadoop 2.0 and decentralization — the distribution of some of the monitoring functions to the processing nodes — the horizontal scalability increased.

Another limitation of Hadoop systems is the size of the RAM on the name node (NameNode), which stores the entire namespace of the cluster for distribution processing, moreover, the total number of files that the name node is capable of processing is 100 million. To overcome this limitation, work is underway to distribute the name node, which is the same in the current architecture for the entire cluster, into several independent nodes. Another way to overcome this limitation is to use distributed DBMS over HDFS, such as HBase, where the role of files and directories in which, from the point of view of the application, are records in one large database table.

As of 2011, a typical cluster was built from single-processor multi-core x86–64-nodes running Linux with 3–12 disk storage devices connected by a network with a 1 Gbps bandwidth. There are trends in how to reduce the computational power of the nodes and the use of low-power processors (ARM, Intel Atom), and the use of high-performance compute nodes simultaneously with high-bandwidth network solutions (InfiniBand in the Oracle Big Data Appliance, high-performance 10 Gbps Fiber Channel and Ethernet storage network in FlexPod’s “big data” template configurations).

The scalability of Hadoop systems largely depends on the characteristics of the data being processed, first of all, their internal structure and features for extracting the necessary information from them, and the complexity of the processing task, which, in turn, dictate the organization of processing cycles, computational intensity of atomic operations, and, ultimately, the level of concurrency and cluster congestion. The Hadoop manual (first versions, earlier 2.0) indicated that an acceptable level of concurrency is to use 10–100 instances of basic handlers per cluster node, and for tasks that do not require significant CPU time — up to 300; for convolutions, it was considered optimal to use them by the number of nodes multiplied by a coefficient from the range from 0.95 to 1.75 and the constant mapred.tasktracker.reduce.tasks.maximum. With a higher coefficient value, the fastest nodes, having completed the first round of data, will earlier receive a second portion of intermediate pairs for processing, thus increasing the coefficient overloads the cluster, but at the same time provides more efficient load balancing. In YARN, instead, configuration constants are used that determine the values ​​of available RAM and virtual processor cores available to the resource scheduler, on the basis of which the level of parallelism is determined.

The scalability and flexibility offered by Hadoop is perfect for implementing the Ubex project and the massive scale of its operations. Stay tuned for more updates from Ubex as the project continues to develop.

--

--