Connecting your Visualization Software to Hadoop on Google Cloud

David G. Cueva Tello

Published in

Google Cloud - Community

16 min readApr 30, 2020

Part 2 — Hands-on

Overview

This document is the second part of a pair that helps you build an end-to-end solution to enable data analysts secure access to data using business intelligence (BI) tools. The solution uses familiar open source tools from the Hadoop ecosystem and Tableau as the BI tool.

In the first part, you focused on the architecture definition, its components, and their interactions. In this article you step through the process of setting up the components from the architecture that makes up the end-to-end Hive topology on Google Cloud.

This article is intended for operators and IT administrators setting up the environment that provides data and processing capabilities to the BI tools used by data analysts. Throughout the article, you will use “Mary” as the fictitious user identity of a data analyst. This user identity is centralized in the LDAP directory used by both Apache Knox and Apache Ranger. You can also configure LDAP groups, but that task falls outside of the scope of this article.

Objectives

Create an end-to end setup that allows a BI tool to use data from a Hadoop environment.
Authenticate and authorize user requests.
Set up and use secure communication channels between the BI tool and the cluster.

Costs

This article uses billable components of Google Cloud, including:

Before you begin

Set up the project

1.In the Cloud Console, on the project selector page, select or create a Cloud project.

Note: If you don’t plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to the project selector page

2. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

3. In the Cloud Console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

4. Enable the Cloud APIs for Dataproc, Cloud SQL, and Cloud Key Management Service (KMS):

5. In Cloud Shell, set environment variables with the ID your project and the region and zones where the Dataproc clusters will be located:

Feel free to choose a different region and zone, but keep them constant throughout the scripts in this article.

Set up a service account

1.In Cloud Shell, create a service account that will be used by the different products involved with visualization security.

2. Add the the following roles to the service account:

Dataproc Worker: to create and manage Dataproc clusters.
Cloud SQL Editor: for Ranger to connect to its database using Cloud SQL Proxy.
Cloud KMS CryptoKey Decrypter: to decrypt the passwords previously encrypted with KMS.

Creating the backend cluster

Create the Ranger database instance

1.Run the following command to create a MySQL instance to store the Apache Ranger policies. The command creates an instance called cloudsql-mysql with the machine type db-n1-standard-1 type located in the region specified by the ${REGION} variable. See the Cloud SQL documentation for more information.

2. Set the instance password for the user root connecting from any host (‘root‘@’%’). You can use your own password or the one provided as an example below. Make sure you use a minimum of eight characters including at least a letter and a number.

Encrypt the passwords

In this section you create a cryptographic key to encrypt the passwords for Ranger and MySQL. To prevent exfiltration, the key lives within the Key Management Service (KMS) and you cannot view, extract, or export the key bits themselves.

You use the key to encrypt the passwords and write them into files. Then you upload these files into a Cloud Storage bucket so they are accessible to the service account acting on behalf of the clusters.

The service account can decrypt those files because it has the cloudkms.cryptoKeyDecrypter role and access to the files and the cryptographic key. In the unlikely case a file is exfiltrated, an attacker would not be able to decrypt it without the role and key.

As an additional security measure you create separate password files for each service to minimize the blast radius in case a password itself is exfiltrated.

For more information about key management, see the KMS documentation.

1.In Cloud Shell, create a Key Management Service (KMS) keyring to hold your keys:

2. Create a Key Management Service (KMS) cryptographic key to encrypt your passwords:

3. Encrypt your Ranger admin user’s password using the key. You can use your own password or the one provided as an example below. Make sure you use a minimum of eight characters including at least a letter and a number.

4. Encrypt your Ranger database admin user’s password using the key:

5. Encrypt your MySQL root password using the key:

6. Create a Cloud Storage bucket to store encrypted password files:

7. Upload the encrypted password files to the Cloud Storage bucket.

Create the cluster

In this section you create a backend cluster with Ranger support. For more information about the Ranger optional component in Dataproc, see the Dataproc Ranger Component documentation page.

1.In Cloud Shell, create a Cloud Storage bucket to store the Apache Solr audit logs:

2. Export all the variables required to create the cluster. Some variables that were set before are repeated here for convenience, so you can modify them as needed.

The new variables are:

The name of the backend cluster
The URI of the cryptographic key so that the service account can decrypt the passwords
The URI of the files containing the encrypted passwords

If you have used a different keyring, key, or file names, replace them in the corresponding variables.

3. Run the following command create the backend Dataproc cluster:

The last three lines are the Hive properties to configure HiveServer2 in HTTP mode, so that Apache Knox can call Apache Hive through HTTP.

These are other parameters that appear in the command:

— optional-components=SOLR,RANGER enables Apache Ranger and its Solr dependency
— enable-component-gateway (optional) enables the Dataproc Component Gateway to make the Ranger and other Hadoop User Interfaces available directly from the cluster page in Cloud Console without the need for SSH tunneling to the backend master node.
— scopes=default,sql-admin authorizes Apache Ranger to access its Cloud SQL database.

It is possible to include additional parameters and properties to create an external Hive metastore that persists beyond the lifetime of any given cluster, and is usable across multiple clusters. See the Using Apache Hive on Cloud Dataproc documentation for more information.

Take into account that the gcloud dataproc jobs submit hive commands use Hive binary transport, and are not compatible with HiveServer2 configured in HTTP mode. Therefore, you must run the table creation examples in that documentation directly on Beeline.

Create a sample Hive table

1.In Cloud Shell, create a Cloud Storage bucket to store a sample Apache Parquet file:

2. Copy a publicly available sample Parquet file into your bucket:

3. Connect using SSH into the master node of the backend cluster you created in the previous section.

The name of your cluster master node is the name of the cluster followed by -m. For HA clusters the name has an additional suffix.

If this is the first time connecting to your master node from Cloud Shell, you will be prompted to generate SSH keys.

Alternatively, you can connect using SSH to your name node from the Cloud Console under Dataproc / Cluster Details / VM Instances.

4. Once in the SSH command prompt, connect to the local HiveServer2 using Apache Beeline, which is pre-installed on the master node.

This command will start the Beeline CLI, and pass the name of your Google Cloud project in an environment variable.

Note that Hive is not performing any user authentication, but it requires a user identity to perform most tasks. The admin user here is a default user configured in Hive. User authentication for requests coming from the BI tools are handled by the identity provider configured later in this article with Apache Knox.

5. In the Beeline CLI, create a table using the Parquet file previously copied in your Hive bucket.

6. Verify that the table was created correctly:

7. Exit the Beeline CLI:

8. Take note of the internal DNS name of the backend master. You use this name in the next section as <backend-master-internal-dns-name> to configure the Knox topology and later to configure a service in Ranger.

9. Exit the SSH command line:

Creating the proxy cluster

In this section you create the proxy cluster with the Apache Knox Initialization Action.

Create a topology

1.In Cloud Shell, clone the Dataproc initialization-actions GitHub repository.

2. Create a topology for the backend cluster:

Apache Knox will use the name of the file as the URL path for the topology. Here you changed the name to represent a topology called hive-us-transactions to access the data of the fictitious transactions which we loaded to Hive in the previous section.

3. Edit the topology file:

Take a moment to observe the topology descriptor file. This file defines one topology that points to one or more backend services. Two services are configured with sample values: WebHDFS and HIVE. The file also defines the authentication provider for the services in this topology and authorization ACLs.

4. Apache Knox provides coarse authorization at the service level through ACLs. Add the data analyst sample LDAP user identity “mary” so that she can access the Hive backend service through Knox.

5. Change the HIVE url to point to the backend cluster Hive service. You can find the HIVE service definition at the very bottom of the file, right under the WebHDFS service.

Replace the <backend-master-internal-dns-name> placeholder with the internal DNS name of the backend cluster that you obtained in the previous section.

6. Save the file and close the editor.

At this point you can create additional topologies by repeating the steps in this section, and creating one independent XML descriptor per topology.

In a subsequent section you copy these files into a Cloud Storage bucket. To create new topologies or change them after you create the proxy cluster, modify the files and upload them again to the bucket. The initialization action creates a cron job that regularly copies changes from the bucket to the proxy cluster.

Configure the SSL/TLS certificate

When clients communicate with Apache Knox they will use an SSL/TLS certificate. The initialization action can generate a self-signed certificate or you can provide your CA-signed certificate.

To generate a self-signed certificate follow these instructions:

1.In Cloud Shell edit the Apache Knox general configuration file:

2. Replace HOSTNAME by the external DNS name of your master node as the value for the certificate_hostname attribute. For demonstration purposes use localhost. This value needs you to create an SSH tunnel between your local machine and the proxy cluster, which is covered later in this article.

Note that this file also contains the master_key that is used to encrypt the certificates that BI tools will use to communicate with the proxy cluster. By default this key is the word secret.

3. Save the file and close the editor.

On the other hand, if you are providing your own certificate, you can specify it in the property custom_cert_name.

Spin up the proxy cluster

1.In Cloud Shell, create a Cloud Storage bucket to provide the configurations from the previous section to the Knox initialization action:

2. Copy all the files from the Knox initialization action folder into the bucket:

3. Export all the variables required to create the cluster. Some variables that were set before are repeated here for convenience so you can modify them as needed.

4. Create the proxy cluster:

Verify connection through proxy

1.After the proxy cluster is created, connect to its master node using SSH from Cloud Shell:

2. Run a query from the proxy cluster master node SSH command prompt:

Let’s analyze the command:

The beeline command uses localhost instead of the DNS internal name because the certificate that you generated when you configured Knox specifies localhost as the host name. If you are using your own DNS name or certificate, use the corresponding host name.
The port is 8443, which corresponds to the Apache Knox default SSL port.
The next line enables SSL, and provides the path and password for the SSL Trust Store to be used by client applications such as Beeline.
The transportMode line should look familiar. It indicates that the request should be sent over HTTP, and provides the path for the HiveServer2 service. Note that the path is composed of the keyword gateway, followed by the topology name that you defined in a previous section, followed by the service name configured in said topology, in this case hive.
With the -e parameter you provide the query to run on Hive. If omitted, you will open an interactive session in the Beeline CLI.
With the -n parameter you provide a user identity and password. In this case you are using the default Hive admin user. In the next sections you create an analyst user identity and set up credentials and authorization policies for this user.

Add user to authentication store

By default Knox includes an authentication provider based on Apache Shiro configured with BASIC authentication against an ApacheDS LDAP store. In this section you will add a sample data analyst user identity “mary” to this authentication store.

In the proxy master node SSH command prompt, install the LDAP utils:

2. Create an LDAP Data Interchange Format (LDIF) file for the new user “mary”:

3. Add the user ID to the LDAP directory:

The -D parameter specifies the distinguished name (DN) to bind when accessing the directory, which must be a user identity already in the directory, in this case the user admin.

4. Verify that the new user was added:

5. Take note of the internal DNS name of the proxy master. You use this name in the next section as <proxy-master-internal-dns-name> to configure the LDAP synchronization.

6. Exit the SSH command line

Setting up authorization

Sync user identities into Ranger

To make sure that Ranger policies apply to the same user identities used by Knox, configure the Ranger UserSync daemon to sync the identities from the same directory that Knox is using.

In this example you connect to the local LDAP that is available by default with Apache Knox, but in a production environment you should set up an external identity directory. You can find more information in the Apache Knox user guide and in the Cloud Identity, Managed Active Directory and Federated AD documentation from Google Cloud.

1.Connect to the master node of the backend cluster you created previously using SSH

2. Edit the UserSync configuration file:

3. Set the values of the following LDAP properties. Make sure you are modifying the user properties and not the group properties, which have similar names.

Replace the <proxy-master-internal-dns-name> placeholder with the internal DNS name of the proxy server, which you obtained in the last section.

Note that this is a subset of a full LDAP configuration that syncs both users and groups. See this documentation for more information.

4. Save the file and close the editor.

5. Restart the Ranger UserSync daemon:

You can verify that the user identities, including the one for the data analyst Mary are synced correctly in the log file under /var/log/ranger-usersync/

Create Ranger policies

Configure the ranger service

1.In the master node SSH command prompt, edit the Ranger Hive configuration:

2. Edit the <value> of the ranger.plugin.hive.service.name property:

3. Save the file and close the editor.

4. Restart the HiveServer2 Admin service. You are ready to create Ranger policies.

Set up the service in the ranger Admin UI

1.In a browser, navigate to the Dataproc page in Cloud Console.

Click your backend cluster name and then click Web Interfaces.

Because you created your cluster with Component Gateway, you should then see a list of the Hadoop components that are installed in your cluster.

Click the Ranger link to open the Ranger UI.

If you prefer not to create your cluster with Component Gateway, you can create an SSH tunnel from your workstation into the backend master node. The Ranger UI is available on port 6080. You will need to create a firewall rule to open ingress to that port.

2. Log in into Ranger with the user admin and the Ranger admin password that you previously defined. The Ranger UI shows the Service Manager screen with a list of services.

3. Click the plus sign in the HIVE group to create a new Hive service

4. Fill out the form with the following values:

Service name: ranger-hive-service-01. This is the name you previously defined in the ranger-hive-security.xml configuration file.
Username: admin
Password: admin-password
jdbc.driverClassName: leave the default org.apache.hive.jdbc.HiveDriver
jdbc.url: jdbc:hive2://<backend-master-internal-dns-name>:10000/;transportMode=http;httpPath=cliservice

Replace the <backend-master-internal-dns-name> placeholder with the name you obtained in a previous section.

5. Click the Add button.

Keep in mind that each Ranger plugin installation supports only one Hive service. An easy way to configure additional Hive services is to spin up additional backend clusters, each with its own Ranger plugin. These clusters can share the same Ranger DB, so that you have a unified view of all the services whenever you access the Ranger Admin UI from any of those clusters.

Set up a Ranger policy

This policy allows the sample analyst LDAP user “mary” access to specific columns of the Hive table:

1.Back on the Service Manager screen, click the name of the service you just created.

Ranger Admin shows the Policies screen.

2. Click the Add New Policy button.

With this policy you will give Mary the permission to see only the columns submissionDate and transactionType from table transactions.

3. Fill out the form with the following values:

Policy name: any name, for example allow-tx-columns
Database: default
Table: transactions
Hive column: submissionDate, transactionType

Allow conditions:

Select user: mary
Permissions: select

4. Click Add at the bottom of the screen

Test the policy with Beeline

1.Back in the master node SSH command prompt, enter the Beeline CLI with the user “mary”. Note that the password is not enforced.

2. Run the following query to verify it is blocked by Ranger. (It includes the column transactionAmount, which Mary is not allowed to select):

You should get a Permission denied error.

3. Run the following query to verify it is allowed by Ranger:

4. Exit the Beeline CLI.

5. Exit the SSH command line.

6. Back on the Ranger UI, click the Audit tab on top. You should see the Denied and Allowed events. You can filter the events by the service name you previously defined, for example: ranger-hive-service-01.

Connecting from a BI tool

The final step in this article is to query the Hive data from BI tools such as Tableau and Looker. We use Tableau as an example of a BI tool, so we assume that you have Tableau Desktop running on your workstation.

Create a firewall rule

Take note of your public IP address.
In Cloud Shell, create a firewall rule that opens TCP port 8443 for ingress from your workstation.

Replace the <your-public-ip> placeholder with your public IP.

3. Apply the network tag from the firewall rule to the proxy cluster master node:

4. Take note of the proxy cluster master node external IP address; you will use it in the next steps:

Create an SSH tunnel

This step is only necessary if you are using a self-signed certificate valid for localhost. If you are using your own certificate or your master node has its own external DNS name, you can skip this step.

1.In Cloud Shell, generate the command to create the tunnel:

2. Install the Google Cloud SDK on your workstation.

3. Run gcloud init to authenticate your user account and grant access permissions.

4. Open a terminal in your workstation.

5. Create an SSH tunnel to forward port 8443.

6. Copy the command generated in the first step and paste it into the workstation terminal. Run the command.

7. Leave the terminal open so the tunnel remains active.

Connect to Hive

1.Install the Hive ODBC driver on your workstation.

2. Open Tableau Desktop, or restart it if it was open.

3. On the home screen under Connect / To a Server, select More…

4. Search and select Cloudera Hadoop.

5. Fill out the fields as follows:

Server: localhost if you created a tunnel, or your master node external DNS name if not.
Port: 8443
Type: HiveServer2
Authentication: Username and Password
Username: mary
Password: mary-password
HTTP Path: gateway/hive-us-transactions/hive
Require SSL: yes

Note that you are using the sample data analyst LDAP user mary as the user identity.

6. Click Sign In.

Query Hive data

1.On the Data Source screen, click Select Schema and search for default.

2. Double-click the default schema name.

The Table panel loads.

3. In the Table panel, double-click New Custom SQL.

The Edit Custom SQL window opens.

4. Enter the following query:

5. Click OK.

The metadata for the query is retrieved from Hive.

6. Click the Update Now button.

Tableau retrieves the data from Hive.

7. Now try to select all columns from the transactions table. In the Table panel, double-click New Custom SQL again. The Edit Custom SQL window opens.

8. Enter the following query:

9. Click OK. You get an error message:

Permission denied: user [mary] does not have [SELECT] privilege on [default/transactions/*].

This is expected because Mary is not authorized by Ranger to read the transactionAmount column. This way you can limit what data your Tableau users can access.

To be able to see all the columns, repeat the steps with the user admin.

Conclusion

In these articles you have seen how you can access Hive data from a BI tool. You have configured Tableau to communicate through a secure channel to Apache Knox running on a Dataproc cluster, and you have authenticated a user with an LDAP directory. Finally, you have set up a Hive service that can process the request in Dataproc backend clusters, further protecting the data by means of authorization policies defined in Apache Ranger.