Accessing secure HBase on HDP from IBM Analytics Engine powered by Apache Spark

Published in

IBM Data Science in Practice

7 min readDec 16, 2020

This blog is written in collaboration with Rachit Arora Software Architect at IBM Analytics Engine(Cloud Pak for Data) and IBM Watson Studio Spark Environments, Dharmesh Jain Senior Architect, IBM Analytics Engine and Deepashree Gandhi Lead Developer, IBM Analytics Engine

Introduction

HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases. It is well suited for real-time data processing or random read/write access to large volumes of data.

Some of the use cases relevant to HBase are :

daily stock market data that needs to be queried from the user interface in real time
insurance claim data that varies in schema based on the analytics requirements of the insurance provider
customer communication data that has both dynamic schema requirements (based on communication type) and real time query requirements of communication features from the user interface.

This document provides step wise details of how to access HBase running on secure HDP from a Spark job that is running on IBM Cloudpak for Data.

Spark requires connectors to access HBase. The two most popular connectors in use are the Apache Spark HBase Connector (hbase-spark) and the Spark Hortonworks Connector (shc-core). While the high level steps in this document apply to both these connectors, the Spark job level details in this document refer to the SHC.
This document assumes a basic understanding of Kerberos based authentication.

Pre-requisites:
Product versions :
HDP 3.1.0.0
IBM Cloud Pak for Data 3.5
Apache Spark 2.4 (part of IBM Analytics Engine powered by Apache Spark)
Kerberos needs to be enabled on HDP
The keytab and principal details for the user that needs to access hbase are available. Typically this keytab is the hbase.headless.keytab.

This document assumes a basic understanding of submitting Spark jobs using IBM Analytics Engine powered by Apache Spark (hereafter referred to as IBM Analytics Engine). Please refer to the product documentation for more details on using IBM Analytics Engine

Overview

IBM Analytics Engine supports accessing secure services on HDP through delegation tokens.

Delegation tokens complement Kerberos by providing a way to pass (delegate) client credentials to the various nodes/services involved in running a job , without every involved service having to access a centralized Kerberos server (KDC) for authentication.

The following figure shows the high level architecture of the various components involved in the integration of IBM Analytics Engine with Hadoop.

IBM Analytics Engine and Secure Hadoop — Architectural view

As seen in the figure the Spark job runs on a cluster on IBM Cloudpak for Data and accesses a secure Hadoop to access data from HDFS, Hive or HBase. The focus of this document is limited to the integration with HBase.

In order to access the secure HBase, a delegation token has to be obtained using a keytab and principal that has access to HBase. This is done outside the scope of the Spark job. The token is then passed to the Spark job through the job submission payload. Thereafter all the communications between the job and the Hadoop components happen using the delegation token without having to use a keytab. HBase works closely with Zookeeper to manage the RegionServers. This handled internally. No specific action has to be taken for Zookeeper authentication, from a delegation token perspective.

The high level steps involved in accessing secure HBase from IBM Analytics Engine are as follows :
1. Obtain the HBase delegation token from the HBase server using the keytab and principal of the hbase user (or any other user that has been configured to access to HBase).

2. Encode the obtained token (base64)

3. Create the Spark job (providing the necessary Hbase configurations)

4. Prepare the job payload and submit the Spark job.

The rest of the document explains the above steps in detail.

Steps:

1. Obtain the HBase Delegation Token :
The HBase token has to be obtained from the HBase server using the TokenUtil API. This requires using the keytab and principal of the user that has access to HBase. The following sections break down the task of obtaining the token.

a. Create the HBase connection :
Create an HBaseConfiguration with the HBase and HDFS details. This is done by adding the hbase-site.xml and core-site.xml to the config. Use the kerberos ticket cache to get the user information for the logged in user. The ticket cache is generated by running the kinit command with the keytab for the user that has access to hbase. The code snippet below assume the cache is available at the default location. This location is listed in the /etc/krb.conf file on the HDP servers.

Configuration config=HBaseConfiguration.create();
 config.addResource(new Path(“/usr/hdp/current/hadoop-client/conf/core-site.xml”));
 config.addResource(new Path(“/usr/hdp/current/hbase-client/conf/hbase-site.xml”));
 User user=null;
 try {
 UserGroupInformation.setConfiguration(config);
 
 user = User.create(UserGroupInformation.getUGIFromTicketCache(“/tmp/krb5cc_1006”, principalValue));
 
 } catch (IOException e1) {
 // TODO Auto-generated catch block
 e1.printStackTrace();
 } 
 try {
 
 Connection conn;
 conn = ConnectionFactory.createConnection(config);

Alternately, the config can also be created by adding the following properties explicitly to the config object instead of pointing to the hbase-site.xml :

config.set(“hadoop.security.authentication”, “kerberos”);
 config.set(“hbase.zookeeper.quorum”, “myhdpmaster.com,myhdpworker.com”);
 config.set(“hbase.zookeeper.property.clientPort”, “2181”);
 config.set(“zookeeper.znode.parent”,”/hbase-secure”);
 config.set(“hbase.security.authentication”, “kerberos”);
 config.set(“zookeeper.sasl.client”,”false”);
 config.set(“hbase.master.kerberos.principal”, principalValue);
 config.set(“hbase.regionserver.kerberos.principal”, principalValue);

b. Obtain the HBase token using the TokenUtil:
The TokenUtil class provides basic methods to handle HBase tokens. the Spark job. Use this class to obtain the token that was generated for the current user that the Spark job is running for .

Token<AuthenticationTokenIdentifier> token = TokenUtil.obtainToken(conn,user);

c. Persist the Token to the file system
The obtained token is persisted to a file for downstream processing. Use the Credentials API to persist the token to a file.

Credentials creds = new Credentials ();
 if (tokenFileName.length () != 0) {
 
 creds = Credentials.readTokenStorageFile (“tokenFile.dt”, conn.getConfiguration());
 }
 
 creds.addToken (new Text (“hbase”), token);
 creds.writeTokenStorageFile (tokenFile, conn.getConfiguration());

2. Encode the Token:
IBM Analytics Engine requires the token to be base 64 encoded in the payload. Use the base64 utility (in Linux) to encode the saved token file.
>base64 <path to saved token> <destination path to encoded token>

Alternately, use a simple Java method like the one below to read the file and encode the token:

public static String readHBaseDelegationToken(String tokenFilename) throws IOException {
 
 File file = new File(tokenFilename); 
 FileInputStream tokenFile = new FileInputStream(file);
 byte[] bytes= new byte[(int)file.length()];
 tokenFile.read(bytes);
 String b64Token = Base64.getEncoder().encodeToString(bytes);
 return b64Token;
 }

3. Create the Spark Job:
Now that the delegation token is generated , the Spark job that needs to access HBase needs to be coded with the necessary functionality . As such there is nothing specific in the code that needs to be set to use the delegation tokens. The properties that are generally set for any HBase instance secured with kerberos are applicable.
The following snippet shows the code that sets these properties . As seen below, the code uses the Hortonworks shc-core connector to access HBase.

String htc = HBaseTableCatalog.tableCatalog();
 
 optionsMap.put(htc, catalog);
 optionsMap.put(HBaseRelation.HBASE_CONFIGURATION(),”{ \”hadoop.security.authentication\”: \”kerberos\”,\”hbase.security.authentication\”:\”kerberos\”,\”hbase.zookeeper.quorum\”: \”hdp-master.com,hdp-worker.com\”, \”hbase.zookeeper.property.clientPort\”: \”2181\”, \”zookeeper.znode.parent\”: \”/hbase-secure\”, \”zookeeper.sasl.client\”: \”false\”}”);Dataset dataset = sqlContext.read().options(optionsMap)
 
 .format(“org.apache.spark.sql.execution.datasources.hbase”).load();
 dataset.show();

In the above code snippet the hbase configurations provided in the options Map. This can also be provided through the hbase-site.xml. The hbase-site.xml will need to contain the exact same hbase configuration details as mentioned in the above code snippet. The folder containing the the hbase-site.xml has to be part of the extraClasspath for the driver and the executor in the Spark job payload.The payload details are provided in the next section.

Note: The shc-core jar that comes with HDP 3.1 works with Spark 2.3. IBM Analytics Engine ships with Spark 2.4. Hence the shc-core needs to be built with Spark 2.4 to avoid run time conflicts. The same applies to the Apache hbase-spark connector.

4. Prepare the Payload and Submit the Job:
Make the encoded token file available to the application that generates the IBM Analytics Engine Spark job payload. The conf section of a sample payload is listed below. The rest of the payload does not require anything specific for HBase. The key items to note in this payload in the context of HBase are :
The spark.executor.extraClassPath and the driver class paths need to include the necessary hbase jars copied from the HDP master server.
The base64 encoded delegation token needs to be set under ae.spark.remoteHadoop.delegationToken
ae.spark.remoteHadoop.isSecure needs to be set to true to indicate that HDP is secure.

“conf”: {
 “spark.executor.extraClassPath”: “/zen-volume-home/*:/zen-volume-home/hdp/3.1.0.0–78/hbase/lib/*”,
 “spark.driver.extraClassPath”: “/zen-volume-home/*:/zen-volume-home/hdp/3.1.0.0–78/hbase/lib/*”,
 “ae.spark.remoteHadoop.delegationToken”:”SERUUwABBWhiYXNlMgAAAC4IABIYaGJhc2UtZmNpY2x1c3RlckBGQ0kuSUJNGCUg0MbPn9kuKNDOgcDbLjAuFIJ1jbFNxC2FKEqVxc+ihXcxpqVLEEhCQVNFX0FVVEhfVE9LRU4kZWExNGZkZjEtMTU0NS00Yjg1LTk5YTAtNmEwNDc1Mjg3M2IxAA==”,
 “ae.spark.remoteHadoop.isSecure”: “true”,
 “ae.spark.remoteHadoop.services”: “HBase”
 }

Once the payload is prepared, the job can be submitted using the IBM Analytics Engine job API. Behind the scenes, the IBM Analytics Engine, copies the passed delegation token to a file and sets the location of the file in the HADOOP_TOKEN_FILE_LOCATION variable in the Spark environment. Spark leverages the passed in token to authenticate on behalf of the client.

Note: Note that delegation tokens require periodic (defaults to 24 hours) renewals and the tokens themselves have a life time of 7 days by default. Renewals and refresh of the tokens have to be handled by the applications owning the Spark job .

The full listing of the payload for job submission is as follows:

{
 “engine”: {
 “type”: “spark”,
 “conf”: {
 “spark.executor.extraClassPath”: “/zen-volume-home/*:/zen-volume-home/site/:/zen-volume-home/krb/lib/hbase/usr/hdp/3.1.0.0–78/hbase/lib/*”,
 “spark.driver.extraClassPath”: “/zen-volume-home/*:/zen-volume-home/site/:/zen-volume-home/krb/lib/hbase/usr/hdp/3.1.0.0–78/hbase/lib/*”,
 “ae.spark.remoteHadoop.delegationToken”:”SERUUwABBWhiYXNlMgAAAC4IABIYaGJhc2UtZmNpY2x1c3RlckBGQ0kuSUJNGCUg0MbPn9kuKNDOgcDbLjAuFIJ1jbFNxC2FKEqVxc+ihXcxpqVLEEhCQVNFX0FVVEhfVE9LRU4kZWExNGZkZjEtMTU0NS00Yjg1LTk5YTAtNmEwNDc1Mjg3M2IxAA==”,
 “ae.spark.remoteHadoop.isSecure”: “true”,
 “ae.spark.remoteHadoop.services”: “HBase”},
 “size”: {
 “num_workers”: 1,
 “worker_size”: {
 “cpu”: 1,
 “memory”: “1g”
 },
 “driver_size”: {
 “cpu”: 1,
 “memory”: “1g”
 }
 },
 “volumes”: [
 {
 “volume_name”: “fciiappvolume”,
 “source_path”: null,
 “mount_path”: “/zen-volume-home/”
 }
 ]
 },
 “application_jar”: “/zen-volume-home/Hbase-0.0.1-SNAPSHOT.jar”,
 “application_arguments”: [
 “10”,”from xml hummingbird”
 ],
 “main_class”: “com.ibm.hbaseshc.HBaseApp”
}

The following is the complete code listing for the token generation. Use the token generated by the following code in the payload listed above, after doing a base64 encoding of the token (See the readHBaseDelegationToken method listed in the Encode the Token section in the document.

public static void main(String[] args) {
 // TODO Auto-generated method stub
 String principalValue=args[0];
 String keytabFile =args[1];
 
 Configuration config=HBaseConfiguration.create();
 config.addResource(new Path(“/usr/hdp/current/hadoop-client/conf/core-site.xml”));
 config.addResource(new Path(“/usr/hdp/current/hbase-client/conf/hbase-site.xml”));
 User user=null;
 try {
 UserGroupInformation.setConfiguration(config);
 
 user = User.create(UserGroupInformation.getUGIFromTicketCache(“/tmp/krb5cc_1006”, principalValue));
 
 } catch (IOException e1) {
 // TODO Auto-generated catch block
 e1.printStackTrace();
 } 
 try {
 
 Connection conn;
 conn = ConnectionFactory.createConnection(config);
 //Generating Delegation Token
 generateToken(conn, user);
 
 }
 catch (IOException e) {
 // TODO Auto-generated catch block
 System.out.println(“IO Exception:”+e.getMessage());
 e.printStackTrace();
 }  }
 
 public static void generateToken(Connection conn,User user) {
 try {
 
 Token<AuthenticationTokenIdentifier> tokenHbase = TokenUtil.obtainToken(conn,user); 

 Credentials creds = new Credentials ();
 if (tokFileName.length () != 0) {
    creds = Credentials.readTokenStorageFile (tokenFile,conn.getConfiguration());
 }
 
 creds.addToken (new Text (“hbase”), tokenHbase);
 creds.writeTokenStorageFile (tokenFile, conn.getConfiguration());
 }catch(Exception e ) {e.printStackTrace();}
 
 }

Conclusion:
This document showed how to use delegation tokens to access secure HBase from IBM Analytics Engine.

Accessing secure HBase on HDP from IBM Analytics Engine powered by Apache Spark

Introduction

Overview

Steps:

Written by Rishi S Balaji