Cloudera Cluster Deployment Automation

Published in

Upstream Engineering

10 min readDec 18, 2020

Written by Fanis Korlos

Big Data can be defined using the famous 3 Vs — Volume, Velocity and Variety. One could think that such a notion would be of interest only to global behemoths, the likes of Google and Amazon. Nevertheless, during the last 10 years that we have experienced an unprecedented boom in software startups, many smaller to medium-sized companies have come to realize the merits of crunching the data they produce. Such a practice can lead to enhanced user experience, faster troubleshooting, and better decision making overall.

At Upstream we have chosen some well established open-source Apache-licensed platforms such as Hadoop, Spark and Kafka in order to handle the events that our applications produce. As part of our effort to evaluate various reliable engines for SQL processing over data stored in HDFS, we decided to try the Impala query engine.

Since the existing documentation regarding the installation process of the Apache-licensed Impala daemons was quite limited, we chose to install it using the version shipped by Cloudera.

Cloudera to the rescue

Cloudera is a US-based company which offers CDH (Cloudera Distributed Hadoop), an open-source platform distribution. At the heart of CDH lies Cloudera Manager which allows managing CDH clusters end-to-end, and is comprised of the following main components:

Cloudera Manager Server: An application that includes an Admin Console Web UI as well as an API and is responsible for installing, configuring and managing the cluster on which Big Data services run.
Cloudera Manager Agent: An application that is installed in each host of the cluster responsible for monitoring the host and managing the running services based on the instructions of the Server.

Setting up a Cloudera Manager Cluster

The process of installing a Cloudera Manager Cluster consists of two steps:

The first one has to do with the installation of the Cloudera Manager Server and Agent services to our hosts. At Upstream we use the Ansible orchestration tool for OS-provisioning and software installation, hence it was the obvious choice for the implementation of this step.
The second one has to do with the Big Data services installation (Hadoop, Spark, Impala etc.) via the Cloudera Manager Server. The easiest and best documented way for this is via the Cloudera Manager Server Web UI, a wizard-like intuitive interface that allows the user to select which service will be installed to which host, and perform the required configuration.

But what if apart from our main production cluster, we need to set up a second one for new feature testing, or a third one for performance testing? What if we need to spin-up a whole short-lived cluster just to look into the feasibility of a new idea? In such a case, the administrator has to manually repeat the tedious task of applying the required settings via the UI, adjusting them every time to the needs of the specific cluster.

That is where the Cloudera Manager API comes in handy. Since the step-by-step process of creating a Cloudera Cluster via the API is not clearly documented, and the API offers an overwhelming amount of options, we decided to investigate how to utilize it in order to produce a fully automated deployment procedure.

Utilizing the Cloudera Manager API

In order to showcase our approach of utilizing the API, we will provide an example of setting up an HDFS service that spans across three hosts, one NameNode and two DataNodes, by following discrete and intuitive steps.

We make the assumption that the hosts used in our demonstration have been setup with FQDNs host-master-fqdn, host-slave1-fqdn and host-slave2-fqdn respectively. As a prerequisite step, Cloudera Manager Agent service has been installed to all three of them, and a Cloudera Manager Server instance is running on host host-master-fqdn. Finally, the parcel including the software versions of the components we need to install has been successfully distributed and activated. A parcel is a Cloudera specific binary distribution format containing the program files, along with additional metadata used by Cloudera Manager. Cloudera Manager Server by default utilizes port 7180, which provides access to both the Web UI and the Rest API. All the following commands are issued on host host-master-fqdn. Note that the output of the issued commands corresponds to Cloudera API version 6.3.0. [API Ref]

At first we retrieve the API version, which is a prerequisite for the API calls to follow:

$ curl -XGET -u admin:admin http://localhost:7180/api/versionv33

Cloudera Manager by default assigns a hostId uuid value to each host that runs the Cloudera Agent daemon and sends a heartbeat to the Cloudera Manager Server (abridged output to only depict the fields of interest):

$ curl -XGET -u admin:admin http://localhost:7180/api/v33/hosts{
  "items" : [ {
    "hostId" : "e7e98d00-e7d1-4ab9-a2be-5716e30c1346",
    "hostname" : "host-master-fqdn"
  }, {
    "hostId" : "33376508-c3d8-452f-a0f9-f50f770c2bea",
    "hostname" : "host-slave1-fqdn"
  }, {
    "hostId" : "412f8599-f5f4-4193-b454-5f42506011e6",
    "hostname" : "host-slave2-fqdn"
  }]
}

This hostId to FQDN mapping will come handy in various subsequent API calls, since Cloudera Manager recognizes hosts by their hostId.

A Cloudera Manager installation can orchestrate multiple clusters, but we use the convention of one cluster per installation to keep things simple. In order to add a new cluster named upstream:

$ curl -XPOST -u admin:admin -H "content-type:application/json" -d @cm-cluster http://localhost:7180/api/v33/clusters
$ cat cm-cluster{
  "items" : [ {
    "name" : "upstream",
    "version" : "CDH6",
    "fullVersion" : "6.3.0"
  } ]
}

We add our hosts to the newly created cluster using the hostId values that were retrieved from a previous step:

$ curl -XPOST -u admin:admin -H "content-type:application/json" -d @cm-cluster-hosts http://localhost:7180/api/v33/clusters/upstream/hosts$ cat cm-cluster-hosts{
  "items" : [ {
    "hostId" : "e7e98d00-e7d1-4ab9-a2be-5716e30c1346"
  }, {
    "hostId" : "33376508-c3d8-452f-a0f9-f50f770c2bea"
  }, {
    "hostId" : "412f8599-f5f4-4193-b454-5f42506011e6"
  } ]
}

The next step is to check the serviceTypes that our CDH cluster supports (output might differ based on CDH and parcels versions):

$ curl -XGET -u admin:admin http://localhost:7180/api/v33/clusters/upstream/serviceTypes{
"items" : [ "SOLR", "ACCUMULO_C6", "ADLS_CONNECTOR", "LUNA_KMS", "HBASE", "SENTRY", "HIVE", "KUDU", "HUE", "FLUME", "DATA_CONTEXT_CONNECTOR", "SPARK_ON_YARN", "THALES_KMS", "HIVE_EXEC", "HDFS", "OOZIE", "ISILON", "SQOOP_CLIENT", "KS_INDEXER", "ZOOKEEPER", "YARN", "KMS", "KEYTRUSTEE", "KEYTRUSTEE_SERVER", "KAFKA", "IMPALA", "AWS_S3" ]
}

In order to create the HDFS service the caller can pack all the relevant information (service configuration, roles, configuration groups) in one API call, but we choose to set up the service piecemeal to maintain simplicity and readability.

At first we enable the HDFS service in our cluster:

$ curl -XPOST -u admin:admin -H "content-type:application/json" -d @cm-service http://localhost:7180/api/v33/clusters/upstream/services$ cat cm-service{
  "items" : [ {
    "name" : "hdfs",
    "type" : "HDFS"
  } ]
}

Checking the available RoleTypes of our HDFS service will require the following call:

$ curl -XGET -u admin:admin http://localhost:7180/api/v33/clusters/upstream/services/hdfs/roleTypes{
  "items" : [ "DATANODE", "NAMENODE", "SECONDARYNAMENODE", "BALANCER", "GATEWAY", "HTTPFS", "FAILOVERCONTROLLER", "JOURNALNODE", "NFSGATEWAY" ]
}

Our purpose is to assign the role of NAMENODE to host host-master-fqdn, and the role of DATANODE to our slave hosts host-slave1-fqdn and host-slave2-fqdn. Once again we utilize the hostIds that we have already retrieved from a previous step. We define a name for each role that we assign, by convention one that includes the name of the service, the role and an indicative part of the host FQDN. If such a name is not provided in the API call, Cloudera Manager will automatically generate one.

$ curl -XPOST -u admin:admin -H "content-type:application/json" -d @cm-roles http://localhost:7180/api/v33/clusters/upstream/services/hdfs/roles$ cat cm-roles{
  "items" : [ {
    "name" : "hdfs-NAMENODE_master",
    "type" : "NAMENODE",
    "hostRef" : {
      "hostId" : "e7e98d00-e7d1-4ab9-a2be-5716e30c1346"
    }
  }, {
    "name" : "hdfs-DATANODE_slave1",
    "type" : "DATANODE",
    "hostRef" : {
      "hostId" : "33376508-c3d8-452f-a0f9-f50f770c2bea"
    }
  }, {
    "name" : "hdfs-DATANODE_slave2",
    "type" : "DATANODE",
    "hostRef" : {
      "hostId" : "412f8599-f5f4-4193-b454-5f42506011e6"
    }
  } ]
}

As a final step, we need to define the configuration that will be applied to our newly installed HDFS service.

Service configuration is separated into Service-wide and RoleType-wide configuration.

The distinction between the two lies in the fact that the former typically includes settings that affect multiple role types, such as HDFS Replication Factor, whereas the latter is a template that gets inherited by specific role instances, for example by each Data Node.

Service-wide configuration

As part of our example, we will update the value of the HDFS Replication Factor which is represented by the dfs_replication variable (it has a default value of 3):

$ curl -XPUT -u admin:admin -H "content-type:application/json" -d @cm-service-config http://localhost:7180/api/v33/clusters/upstream/services/hdfs/config$ cat cm-service-config{
  "items" : [ {
    "name" : "dfs_replication",
    "value" : "2"
  } ]
}

RoleType-wide configuration

Finally, we need to define the configuration that will be applied to each of our role types. For that purpose we make use of the roleConfigGroups resource of the API:

$ curl -XGET -u admin:admin http://localhost:7180/api/v33/clusters/upstream/services/hdfs/roleConfigGroups | jq '.items[].name'"hdfs-NAMENODE-BASE"
"hdfs-FAILOVERCONTROLLER-BASE"
"hdfs-SECONDARYNAMENODE-BASE"
"hdfs-DATANODE-BASE"
"hdfs-BALANCER-BASE"
"hdfs-GATEWAY-BASE"
"hdfs-JOURNALNODE-BASE"
"hdfs-HTTPFS-BASE"
"hdfs-NFSGATEWAY-BASE"

Cloudera Manager creates one roleConfigGroup per supported roleType using the naming convention <service_name>-<RoleType>-BASE. We can review the current value of our configured parameters (i.e. for NAMENODE):

$ curl -XGET -u admin:admin http://localhost:7180/api/v33/clusters/upstream/services/hdfs/roleConfigGroups/hdfs-NAMENODE-BASE/config{
  "items" : [ {
    "name" : "dfs_namenode_servicerpc_address",
    "value" : "8022",
    "sensitive" : false
  } ]
}

The call reverts only parameters for which we have overridden the default value. In order to review the full list of supported parameters (which can include dozens or hundreds of items) along with a short description for each one, we can append the view=FULL query string to the URL (output omitted for obvious reasons):

curl -XGET -u admin:admin http://localhost:7180/api/v33/clusters/upstream/services/hdfs/roleConfigGroups/hdfs-NAMENODE-BASE/config?view=FULL

We notify the Cloudera Server of the desired configuration values for both the NAMENODE and DATANODE role types:

$ curl -XPUT -u admin:admin -H "content-type:application/json" -d @cm-nn-config http://localhost:7180/api/v33/clusters/upstream/services/hdfs/roleConfigGroups/hdfs-NAMENODE-BASE/config$ cat cm-nn-config
{
  "items" : [ {
    "name" : "dfs_name_dir_list",
    "value" : "/data/hdfs/namenode"
  }, {
    "name" : "namenode_java_heapsize",
    "value" : "1073741824"
  } ]
}$ curl -XPUT -u admin:admin -H "content-type:application/json" -d @cm-dn-config http://localhost:7180/api/v33/clusters/upstream/services/hdfs/roleConfigGroups/hdfs-DATANODE-BASE/config$ cat cm-dn-config
{
  "items" : [ {
    "name" : "dfs_data_dir_list",
    "value" : "/data/hdfs/datanode"
  }, {
    "name" : "datanode_java_heapsize",
    "value" : "1073741824"
  } ]
}

The above calls do not introduce new configuration to Cloudera Server, rather than simply override the existing default values, and that explains the HTTP PUT method used (in contrast to the POST method of the previous steps).

Making the most of the API Functionality using Ansible

Ansible orchestration tool provides a vast collection of modules, which are standalone scripts that can interact with the local machine or a remote system to perform specific tasks. Modules can also interact with APIs, and it is customary for users to develop their own custom modules in order to handle such an interaction.

Since we chose Ansible to perform OS-level provisioning and Cloudera software installation, it was only normal for us to develop our own custom module to handle the communication with the Cloudera Manager API, and perform a fully automated configuration of Cloudera cluster and services.

Cloudera Manager 6.0 introduces a Swagger-based Python API client named cm_client that is compatible with all CM API versions, and replaces the cm-api one that is now considered deprecated.

In order to avoid such external dependencies, and also obtain a deeper understanding of the API functionality, we developed our module using the ubiquitous python requests library, based on the steps of the previous section.

The role of each host is declared in the INI-like static Ansible inventory file:

[hdfs-namenode]
host-master-fqdn[hdfs-datanodes]
host-slave1-fqdn
host-slave2-fqdn

The configuration that was applied step by step in the previous section is assembled using the below yaml syntax:

name:    'hdfs'
type:    'HDFS'
cluster: 'upstream'
config: 
  - name:  dfs_replication
    value: 2
roles:
  - type:  namenode
    hosts: {{ groups['hdfs-namenode'] }}
  - type:  datanode
    hosts: {{ groups['hdfs-datanodes'] }}
roleConfigGroups:
  namenode:
    - name:  dfs_name_dir_list
      value: "/data/hdfs/namenode"
    - name:  namenode_java_heapsize
      value: "1073741824"
  datanode:
    - name:  dfs_data_dir_list
      value: "/data/hdfs/datanode"
    - name:  datanode_java_heapsize
      value: "1073741824"

Since some configuration parameters require different values per deployed cluster, we utilize Jinja2 templating to enable dynamic access to variables:

datanode:
    - name:  datanode_java_heapsize
      value: "{{ datanode_heapsize_bytes }}"

As part of our Ansible tasks, we first create a local copy of the configuration yaml file where all dynamic variables have been replaced with the desired values using the template module:

- name: Create Service configuration file from template
  template:
    src:  'hdfs/config.yml.j2'
    dest: '/tmp/hdfs-config.yml'
  delegate_to: localhost

This file is then provided as input to our custom module, which interacts with the Cloudera Manager API to perform the requested setup:

- name: Perform Cloudera Cluster setup (via the API)
  api_custom_module:
    conf_file: '/tmp/hdfs-config.yml'
  delegate_to: localhost

Next steps

During the last few years Kubernetes has emerged as the de-facto standard workload scheduler in the cloud. At Upstream we are responding to the momentum behind it by deploying and managing our on-premises cluster, with the final purpose of transferring all our live applications to it, including our Big Data stack.

CDH does not inherently support Kubernetes deployments, yet Cloudera recently announced the CDP Private Cloud, designed to work with IBM RedHat’s OpenShift Kubernetes-based private cloud environment.

Combining the ability of the Cloudera API to simplify complex and intertwining configuration between Big Data services, along with the efficiency of resource-management provided by Kubernetes is a prospect that we are eager to start exploring and implementing in our workloads.

Conclusion

Based on the process that we have developed, in order to deploy a new Cloudera cluster from scratch, all the administrator has to do is to define which role each host will undertake in the Ansible inventory file, as well as specify the values of the cluster specific configuration variables, such as the desired java heap size per process.

The custom Ansible module takes care of the communication with the Cloudera Manager API, giving us the ability to not only perform a very fast first time deployment, but also safely update configuration values for an existing cluster, and ensure consistency across all hosts’ and clusters’ configuration.

References

https://cloudera.github.io/cm_api/docs/python-client-swagger/
http://cloudera.github.io/cm_api/docs/quick-start/
https://github.com/cloudera/cm_api/blob/master/python/examples/auto-deploy/deploycloudera.py
https://docs.cloudera.com/documentation/enterprise/latest/topics/cm_intro_automation_api.html#xd_583c10bfdbd326ba--7f25092b-13fba2465e5--7f17
https://docs.cloudera.com/documentation/enterprise/latest/topics/cm_intro_api.html#xd_583c10bfdbd326ba--7f25092b-13fba2465e5--7f20