Gazanfur ali Mohammed
The Startup
Published in
8 min readOct 15, 2019

--

Big Data: How to Plan and Manage a Multi-tenant Hadoop Cluster Seamlessly

Big data is a broad term used to refer to a massive volume of both structured and unstructured data that cannot be processed in a more “traditional” manner due to the voluminous data that needs to be stored and processed. Hadoop is a distributed data store that provides a platform for implementing powerful parallel processing frameworks.

Today, nearly all companies are data-dependent. If a company is not currently data dependent, then it will be soon. Corporations want to store the big data, perform analytics, and try to conclude out of the data. Health care ventures save patients’ vital readings, and financial companies store and track every activity performed on their portals. Big data is even used in fields like sports to analyze the game and to prepare the game plan. Airline companies follow every moment of their flights. Everyone understands the significance of big data and its impact on business strategies.

There are many factors one needs to consider when building big data platforms such as Hadoop clusters. A Hadoop cluster can be extended seamlessly as long as the hardware is readily available. However, it is essential to estimate the initial size of a cluster, so one does not have to worry about cluster expansion immediately. Based on one’s organization needs, one can have multiple small clusters, one for each organization, or one massive cluster serving all the various organizations. There are pros and cons to both approaches.

Hadoop Cluster Architecture

It is common for Hadoop clusters to serve multiple users, groups, and application types. Supporting multi-tenant clusters involve several important considerations to make, such as cluster architecture, security, monitoring, manageability, etc. It is necessary to prepare an inventory of applications that will be run on the cluster. Different applications need different compute resources, such as CPU, memory, storage, and network. Although it will be difficult to estimate the end-state compute capacity requirement at the beginning, it is still possible to a fair understanding of how much initial compute capacity is needed.

It is more common for organizations to deploy a vendor-supported Hadoop distribution, such as Cloudera, Horton Works, or MapR. In this article, some of the features I am referring to are related to MapR distribution. Sizing a Hadoop cluster involves sizing the storage, memory, network, and other resources that are part of the cluster.

There are four types of nodes in a Hadoop cluster.

Master nodes: They contain all of the primary services making up the backbone of Hadoop.

Worker nodes: They handle the bulk of what a Hadoop cluster does, which is store and process data.

Management Nodes: These nodes provide the mechanism to install, configure, monitor, and otherwise maintain the Hadoop cluster.

Edge Nodes: These nodes host web interfaces, proxies, and client configurations that ultimately provide the mechanism for users to take advantage of the combined storage and computing system that is Hadoop.

Depending on high availability requirements, the number of master nodes will vary. Technically all primary services can be run on a single node; however, based on workloads and high availability requirements, one can run each service on multiple dedicated nodes. It is important to plan for high availability for the major master services, such as a zookeeper, resources manager, hive metadata database, etc. Though there are options for a hive metadata database, MySQL database is commonly used for this purpose.
Edge node capacity depends on how many users will be connecting to it and what kind of jobs will be run from edge nodes.
A sample of three rack cluster

A sample of three rack cluster

Storage: By default, Hadoop uses a replication factor of three, which means it creates and maintains three copies of the data. Based on the topology, those three copies of the data can be placed on the same node, on nodes in the same rack, or on the nodes across the racks. It is preferred to have a multi-rack cluster for high data availability. In this configuration, if an entire rack goes down, one will still have two copies of the data residing on the nodes of other racks. During storage capacity planning, one will also need to consider the storage that will be used outside of HDFS. For example, OS needs storage for its logs and any additional software installations, and its maintenance requires extra storage. Special echo system services may require a considerable amount of storage for its logs and core dumps for issue troubleshooting.

Topology: Rack aware topology will help to segregate the resources and storage usage.

Data replication in multi-rack configuration

How do you Share Resources Between Organizations?

Generally, organizations like to have their cluster for two main reasons.

1. They want to have full control of the cluster resources so they can allocate more resources to their jobs and run more tasks in parallel, particularly during month/quarter ends to meet their SLAs.

2. They do not want some other organization’s applications consuming higher resources and causing their operations to run slow.

Two resources sharing mechanisms called capacity scheduler and the fair scheduler can address these concerns. Both schedulers use the concept of queues. One can create one or more queues for each organization and setup resources limits per queue. Furthermore, one can create sub-queues in either of the schedulers. Queues are allocated by a fraction of the capacity of the cluster in the sense that a certain capacity of the resources will be at their disposal. Soft and hard limits can be configured per queue.

A fair scheduler is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. The capacity scheduler is designed to guarantee a minimum capacity.

I prefer to use a fair scheduler with time-sharing. With this approach, each organization can utilize the maximum resources they need without impacting the others.Sample configuration files:

Yarn-Site.xml

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>

</property>

Fair-scheduler.xml

<queue name=”finqueue”>

<aclSubmitApps>finqueue</aclSubmitApps>

<minResources>5000 mb,5 vcores,5 disks</minResources>

<maxResources>25000 mb,15 vcores,15 disks</maxResources>

</queue>

<queue name=”mktqueue”>

<aclSubmitApps>mktqueue</aclSubmitApps>

<minResources>4096 mb,1 vcores,1 disks</minResources>

<maxResources>8192 mb,10 vcores,4 disks</maxResources>

</queue>

How to Manage Cluster Security and Accessibility:

Just like in any other information systems or computing platforms, data confidentiality, integrity, and availability are equally crucial in Hadoop. In a multi-tenant environment where the cluster is shared among multiple organizations like HR, FIN, and marketing, controls need to be in place to restrict data access based on pre-defined authorization.

Authentication, authorization, and accounting refer to an architectural pattern in computer security where users of service prove their identity, are granted access based on rules, and where a recording of a user’s actions is maintained for auditing purposes.

While it is necessary for all users of the cluster to be provisioned on all of the servers in the cluster, it is not required to enable local or remote shell access to all of those users. Hadoop and the related ecosystem can be authenticated using Kerberos and delegation tokens. Some of the vendor-supported Hadoop distributions support Linux pluggable authentication module (PAM) for plain username and password authentication.

Hadoop ecosystem consists of many components such as Sqoop, hive, impala, pig, flume, etc. Traditionally each of these components has its authorization mechanism. The SENTRY provides centralized fine-grained role-based access control (RBAC) to give administrators more flexibility to control what users can access.

Below is a sample access model.

Monitoring and Alerting: Hadoop clusters consist of many echo system components and services. A sophisticated monitoring mechanism is required to monitor various aspects of the cluster, such as service outages, performance metrics, component-specific issues, job failures, long-running jobs, and infrastructure-related issues such as node, disks, network failures, etc. A single monitoring tool may not be able to support all these metrics, in which case one will have to use two or more monitoring tools based on one’s specific monitoring needs.

Tools such as Nagios and Ganglia can be used to monitor the cluster’s resource utilization and trends. Ganglia is a scalable open-source cluster performance monitoring tool that can be run on a wide range of operating systems. It helps to monitor cluster nodes, CPU, memory usage, and nodes both up and down.

There are several built-in alerts available on various metrics, such as volume advisory quota, higher resource utilization, services not responding, nodes going down, disk controller issues, etc. By implementing sophisticated custom monitoring and self-healing tools, we were able to reduce manual intervention drastically by addressing the alerts automatically. For example, if an organization’s resources utilization is reaching a threshold, the system will generate an automated notification with the details such as currently running jobs, its resources utilization, upcoming scheduled jobs, etc. Not a single tool was found that could cover all the customs requirements. A few different tools for monitoring, performance management, and configuration management were used.

Configuration Management: Configuration management typically includes creating users, granting privileges, creating volumes, setting up quotas, scheduling snapshots, setting up alerts, etc. Either a vendor-provided console or CLI commands can perform these tasks. In a MapR distribution, this configuration setup can be done using MCS.

Performance Management: It is instrumental in having cluster performance metrics handy. OS level performance metrics, such as CPU and memory utilization, can be accessed using open source tools such as ganglia. Resource managers provide job-wise resource utilization. There are third-party tools also available such as Pepper Data that provides a granular level breakdown of resource utilization, for example, component-wise resource utilization, job-specific performance tuning recommendations, and many others.

Upgrades/Patching: Automating the maintenance activities not only reduces the overall execution time but also helps to prevent any human errors. Most of the bug fixes or patches can be applied using a rolling upgrade fashion. Rolling upgrade is a method where a set of nodes can be upgraded at a time without bringing down the entire cluster and causing outages. Echo system component patches may be required to apply only on the nodes where the given services are running. Sometimes major version upgrades may require an entire cluster to be brought down. Just keep in mind that any maintenance activities on the hive metadata store can restrict access to metadata and can cause cluster-wide outages. Ansible was used to automate upgrades, bug fixes, node additions, and maintenance activities.

Conclusion

Before architecting Hadoop Cluster, it is essential to understand the user requirements. It is not necessary to focus on the end state at the beginning of the project; instead, start a small cluster and keep extending the cluster as the data grows. Apache Hadoop is the base for all vendor supported Hadoop distributions. Good knowledge of Apache Hadoop will help in the long run. There are many ecosystem components available to ingest, process, and present the data. Depending on one’s specific requirements will determine which elements to use. There are open-source and third-party tools available to monitor the Hadoop cluster. A good understanding of the user requirements, echo system components, and monitoring tools will help to facilitate a quick start.

--

--