CDAP Services for Apache Ambari

cdapio
cdapio
Published in
5 min readApr 21, 2019

October 21, 2015

Chris Gianelloni is a DevOps Engineer at Cask and works on automating all the things to improve developer productivity, including building a self-service cluster architecture with Coopr. Previously, he worked at several startups and as a consultant at large companies such as Apple and Yahoo. Chris has been an open source developer for over a decade and has contributed to multiple projects over his career.

Cask is excited to announce easy CDAP integration for Apache Ambari users. Previously, we introduced you to integration with Cloudera Manager. This post will familiarize you with integration with Apache Ambari, the open source provisioning system for HDP (Hortonworks Data Platform).

Adding the CDAP service to Ambari

To install CDAP on a cluster managed by Ambari, we have provided packages for RHEL-compatible and Ubuntu systems, which are installed onto the Ambari management server. This package adds CDAP to the list of available services which Ambari can install. To install the cdap-ambari-service package, first add the appropriate CDAP repository to your system’s package manager by following the procedure in the CDAP manual for installing the Cask repository on your Ambari server.

The repository version must match the CDAP version which you’d like installed on your cluster. To get the CDAP 3.0 series, you would install the CDAP 3.0 repository. The default is to use CDAP 3.2, which has the widest compatibility with Ambari-supported Hadoop distributions.

Supported distributions

CDAP VersionHadoop DistributionsCDAP 3.0.xHortonworks Data Platform (HDP) 2.0, HDP 2.1CDAP 3.1.xHDP 2.0, HDP 2.1, HDP 2.2CDAP 3.2.xHDP 2.0, HDP 2.1, HDP 2.2, HDP 2.3

The CDAP Ambari service has been tested on Ambari Server 2.0 and 2.1, as supplied from Hortonworks.

Installing via APT

$ sudo apt-get install -y cdap-ambari-service
$ sudo ambari-server restart

Installing via YUM

$ sudo yum install -y cdap-ambari-service
$ sudo ambari-server restart

Adding CDAP to your cluster

Dependencies

CDAP depends on certain services being present on the cluster. There are core dependencies, which must be running for CDAP system services to operate correctly, and optional dependencies, which may be required for certain functionality or program types.

The host running the CDAP Master service must have the HDFS, YARN, and HBase clients installed, as CDAP uses the command line clients for initialization and connectivity information for external service dependencies. Also, CDAP currently requires Internet access on the CDAP service nodes until CDAP-3957 or AMBARI-13456 are resolved.

Core Dependencies

  • HDFS, used as the backing file system for distributed storage
  • MapReduce2, used for batch operations in workflows and data exploration
  • YARN, used for running system services in containers on cluster NodeManagers
  • HBase, used for system runtime storage and queues
  • ZooKeeper, used for service discovery and leader election

Optional Dependencies

  • Hive, used for data exploration using SQL queries via CDAP Explore system service
  • Spark, used for running Spark programs within CDAP applications

Installing CDAP

1. In the Ambari UI, start the Add Service Wizard.

2. Select CDAP from the list and click Next. If there are core dependencies which are not installed on the cluster, Ambari will prompt you to install them.

3. Next, we will assign CDAP services to hosts.

CDAP consists of 4 daemons.

  • Kafka Server, used for storing CDAP metrics and CDAP system service log data
  • Master, coordinator service which launches CDAP system services into YARN
  • Router, serves HTTP endpoints for CDAP applications and REST API
  • UI, web interface to CDAP and Cask Hydrator (for CDAP 3.2 installations)

It is recommended to install all CDAP services onto an edge node (or the NameNode, for smaller clusters) such as in our example above. After selecting the master nodes, click Next.

4. Select hosts for the CDAP CLI client. This should be installed on every edge node on the cluster, or the same node as CDAP for smaller clusters.

5. Click Next to continue with customizing CDAP.

6. On the Customize Services screen, click Advanced to bring up the CDAP configuration. Under Advanced cdap-env, you can configure heap sizes, and log and pid directories for the CDAP services which run on the edge nodes.

7. Under Advanced cdap-site, you can configure all options for the operation and running of CDAP and CDAP applications.

If you wish to use the CDAP Explore service to use SQL to query CDAP data, you must have Hive installed on the cluster, the Hive client on the same host as CDAP, and set the explore.enabled option to true.

For a complete explanation of these options, refer to the CDAP documentation. After making any configuration changes, click Next.

8. Review the desired service layout and click Deploy to begin installing CDAP.

9. Ambari will install CDAP and start the services. After the services are installed and started, you will click Next to get to the Summary screen.

10. This screen shows a summary of the changes that were made to the cluster. No services should need to be restarted following this operation.

11. Click Complete to complete the CDAP installation.

12. Now, you should see CDAP listed on the main summary screen for your cluster.

13. Selecting CDAP from the left or choosing it from the Services drop-down menu will take you to the CDAP service screen.

Congratulations! CDAP is now running on your cluster and managed by Ambari.

Roadmap and Future Features

CDAP integration with Ambari is still evolving and improving. Some features are planned for upcoming versions of the CDAP Ambari Service, including a full smoke test of CDAP functionality after installation, pre-defined alerts for CDAP services, CDAP component HA support, select CDAP metrics, support for Kerberos-enabled clusters, and integration with CDAP Auth Server.

The definition is open source and contributions are always welcome and encouraged!

--

--

cdapio
cdapio
Editor for

A 100% open source framework for building data analytics applications.