Cloudera deployment from Magic button

Dmytro Vedetskyi
DevOops World … and the Universe
5 min readAug 31, 2020

Overview:

In case your business has Cloudera Ecosystem in your own BigData infrastructure on premises and your plan is migration to the AWS. When you need to deploy more than one Cloudera cluster on the different regions to involve new customers then this topic for you. Topic describes how to deploy Cloudera Cluster using automation tools with possibilities of the worker nodes on-demand scaling.

Capabilities:

  • Create required resources in AWS
  • Provisioning instances
  • Installing Cloudera Manager
  • Adding workers to the Cloudera Cluster
  • Enabling Data Encrypting and Secure connection between services
  • Scaling YARN nodes on-demand

Cloudera

Cloudera, Inc. is a US-based software company that provides a software platform for data engineering, data warehousing, machine learning and analytics that runs in the cloud or on premises.

Cloudera started as a hybrid open-source Apache Hadoop distribution, CDH (Cloudera Distribution Including Apache Hadoop), that targeted enterprise-class deployments of that technology. Cloudera states that more than 50% of its engineering output is donated upstream to the various Apache-licensed open source projects (Apache Spark, Apache Hive, Apache Avro, Apache HBase, and so on) that combine to form the Apache Hadoop platform.

Ansible

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows. It includes its own declarative language to describe system configuration. Ansible was written by Michael DeHaan and acquired by Red Hat in 2015. Ansible is agentless, temporarily connecting remotely via SSH or Windows Remote Management (allowing remote PowerShell execution) to do its tasks.

Architecture diagram:

Create required resources in AWS

Automation playbooks/roles stored in the git repository. Roles create resources using CI/CD tool:

  • create ec2 instances (depends on the role properties)
  • create volumes
  • create Route53 (DNS) zones
  • create A and PTR records for instances
  • set hostnames
  • connect instances to the LDAP/AD server

Provisioning instances

We need to do a few steps before installing Cloudera cluster:

  • create manager nodes
  • create worker nodes
  • OS performance tuning
  • enable NTP (vital for cluster)
  • install required packages (java, pip, etc.)

Installing Cloudera Manager

Cloudera Manager manages all services and nodes inside cluster throw Cloudera agent that is the python app. Here is the schema with example of the main node. Basically, cluster should have at least 4 manager nodes

  • One Cloudera Manager and several services (yum repo, CDH repo, Database, HDFS data node, etc.)
  • Three manager nodes (Zookeeper, Yarn Resource Manager, Hbase region server, HDFS data node, YARN, etc)

Adding workers to the Cloudera Cluster

By default cluster uses existing nodes for run MR jobs, store data in HDFS and HBase databases. If MR jobs take too much time, you can easy add additional nodes to increase cluster capacity.

Node needs only several services to be a part of the cluster and execute jobs in parallel:

  • Cloudera agent
  • YARN node manager

Enabling Data Encrypting and Secure connection between services

Data encrypting required for all projects and it is vital part of the secure personal data. Cloudera uses SSL certificates for connection between services. You can use self-signed certificates as well.

Java Secure Socket Extension overview:

Data that travels across a network can easily be accessed by someone who is not the intended recipient. When the data includes private information, such as passwords and credit card numbers, steps must be taken to make the data unintelligible to unauthorized parties. It is also important to ensure that the data has not been modified, either intentionally or unintentionally, during transport. The Secure Sockets Layer (SSL) and Transport Layer Security (TLS) protocols were designed to help protect the privacy and integrity of data while it is being transferred across a network.

The Java Secure Socket Extension (JSSE) enables secure Internet communications. It provides a framework and an implementation for a Java version of the SSL and TLS protocols and includes functionality for data encryption, server authentication, message integrity, and optional client authentication. Using JSSE, developers can provide for the secure passage of data between a client and a server running any application protocol (such as HTTP, Telnet, or FTP) over TCP/IP.

Secure Cloudera:

As a system designed to support vast amounts and types of data, Cloudera clusters must meet ever-evolving security requirements imposed by regulating agencies, governments, industries, and the general public. Cloudera clusters comprise both Hadoop core and ecosystem components, all of which must be protected from a variety of threats to ensure the confidentiality, integrity, and availability of all the cluster’s services and data.

More details provided here https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/sg_edh_overview.html

Secure HDFS:

Ansible roles enables Kerberos Authentication for Hadoop and HDFS without manual actions. Role uses j2 templates with correct configs and upload it to cluster configs.

When Kerberos Authentication enabled you should use kerberos keytab for get or put data into HDFS folder. Kerberos keytab related to the AD that hosts use for ssh access.

On demand YARN nodes

When your data increases and you need to process data faster, solution has possibilities to add nodes to the cluster and dynamically start/stop them on demand for reducing costs. There are few jobs that manages cluster, first job adds nodes into cluster and activate them, next one checks cluster status and does some actions according to the algorithm.

E.g.: if YARN job is running and doesn’t have free containers(resources), job starts additional worker nodes to decrease time for processing.

Summary

If your goal deploy is to cluster on premises or migrate to the AWS while spending less time, resources and deploy the multiple clusters, definitely, this solution for you. Basically, cluster deploys from scratch approximately 40mins-1h. Job deploys cluster automatically without manual steps and provides Url and credentials to get the access to the new cluster.

The main advantages of the solution are IaC and templating. Everything is described in the code, that is more comfortable to read/edit/understand templates.

URLs:

https://www.cloudera.com/
https://en.wikipedia.org/wiki/Cloudera
https://www.ansible.com/
https://en.wikipedia.org/wiki/Ansible_(software)
https://docs.oracle.com/javase/9/security/java-secure-socket-extension-jsse-reference-guide.htm#JSSEC-GUID-93DEEE16-0B70-40E5-BBE7-55C3FD432345
https://docs.oracle.com/cd/E12440_01/rpm/pdf/141/html/merch_sg/apps-chapter%207.htm
https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/sg_edh_overview.htmlhttps://docs.cloudera.com/documentation/enterprise/5-2-x/topics/cdh_sg_secure_hdfs_config.html

--

--