Connecting your Visualization Software to Hadoop on Google Cloud

David G. Cueva Tello

Published in

Google Cloud - Community

9 min readApr 30, 2020

Part 1 — Architecture

Overview

As an IT admin, you want to give your data analysts secure access to familiar Business Intelligence (BI) tools so they can derive insights from the data efficiently. You also want to leverage the computing power and flexibility that cloud offers, and you want to minimize costs. This article shows how to fulfill these requirements using familiar open source tools from the Hadoop ecosystem such as Apache Hive, Apache Ranger, and Apache Knox. You’ll see how to deploy these tools on Dataproc to provide data analytics processing that is fast, easy, and more secure in the cloud.

This article is intended for operators and IT administrators who are interested in setting up secure data access for data analysts using BI tools such as Tableau and Looker, while leveraging familiar open source tools and Google Cloud. This article does not offer guidance for data analysts on how to use those BI tools or guidance for developers who want to interact with Dataproc APIs.

This document is the first part of a pair that helps you build an end-to-end solution to give data analysts secure access to data using BI tools. It includes:

An architecture of the components
A high-level view of the boundaries, interactions, and networking, and
A high-level view of authentication and authorization in the architecture

The second part walks you through a step-by-step process to set up the architecture on Google Cloud.

Architecture

Here is a diagram of the architecture:

At a high level, the architecture is straightforward. In a nutshell:

Client applications connect through JDBC to a single entry point on a Dataproc cluster. The entry point is provided by Apache Knox, which is installed on the cluster master node. The communication with Apache Knox is secured by TLS.
Apache Knox delegates authentication through a provider to a system such as an LDAP directory.
After authentication, Apache Knox routes the user request to one of possibly multiple backend clusters. The routes and configuration are defined as custom topologies.
A service like Apache Hive listening in the selected backend cluster takes the request.
Apache Ranger intercepts the request, validates the user’s authorization, and determines if processing should go ahead or not.
If the validation succeeds, the backend service processes the request and returns the results.

Components

The most important components of the architecture are:

Dataproc: Dataproc is Google Cloud-managed Apache Spark and Apache Hadoop, a service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc is the platform that underpins the solution described in this article.
Apache Knox: Apache Knox acts as a single HTTP access point for all the underlying services in a Hadoop cluster. Apache Knox is designed as a reverse proxy with pluggable providers for authentication, authorization, audit, and other services. Clients send requests to Knox. Based on the request URL and parameters, Knox routes the request to the appropriate Hadoop service. Knox is the centerpiece of the architecture, because it is the entry point that transparently handles client’s requests and hides complexity.
Apache Ranger: Apache Ranger provides fine-grained authorization for users to perform specific actions on Hadoop services. It also implements auditing of user access and administrative actions.
Apache Hive: Apache Hive is data warehouse software that enables access and management of large datasets residing in distributed storage using SQL. Apache Hive parses the SQL queries, performs semantic analysis, and builds a Directed Acyclic Graph (DAG) of stages to be executed by a processing engine. In the architecture in this article, Hive acts as the translation point between the user requests and one of possible multiple processing engines. Apache Hive is ubiquitous in the Hadoop ecosystem and it opens the door to practitioners familiar with standard SQL to perform data analysis.
Apache Tez: Apache Tez is the processing engine in charge of executing the DAGs prepared by Hive and subsequently returning the results.
Apache Spark: Apache Spark is a unified analytics engine for large-scale data processing that supports general execution of DAGs and can be used by Hive. The architecture shows the Spark SQL component of Apache Spark to demonstrate the flexibility of the approach presented in this article. Spark SQL was originally built from HiveServer2 and enables the execution of SQL queries on Apache Spark. One caveat however is that Spark SQL does not have official Ranger plugin support, so authorization must be done through coarse-grained ACLs in apache Knox.

Deep dive

The following sections take a deeper look into each one of the components that participate in the flow and their interactions.

Client applications

These applications include tools that can send requests to an HTTPS REST endpoint, but don’t necessarily support the Dataproc Jobs API. BI tools such as Tableau and Looker have HiveServer2 and Spark SQL JDBC drivers that can send requests through HTTP.

In this article we assume that client applications are external to Google Cloud, executing for example, on an analyst workstation, on-premises, or on another cloud. Therefore, the communication between the client applications and Apache Knox must be secured with an SSL/TLS certificate that can be CA-signed or self-signed.

Entry point and user authentication

The proxy clusters are one or more long-lived Dataproc clusters with the main purpose of hosting the Apache Knox Gateway.

Apache Knox acts as the single entry point for client requests. It is installed on the proxy cluster master node. Knox performs SSL termination, delegates user authentication, and forwards the request to one of the backend services.

Each backend service is configured in what Knox calls a topology. The topology descriptor defines how authentication is delegated for a service, the URI for the backend service to forward requests to, and simple per-service authorization Access Control Lists (ACLs).

Apache Knox enables authentication integration with enterprise and cloud identity management systems. User authentication is configurable per topology using authentication providers. By default, Knox leverages Apache Shiro to authenticate against a local demo ApacheDS LDAP server, but Knox can also use Kerberos. In the diagram, you can see an Active Directory server hosted on Google Cloud outside of the cluster as an example.

Connecting Knox to an enterprise authentication service such as an external ApacheDS server or Microsoft Active Directory (AD) is outside of the scope of these articles. You can find more information in the Apache Knox user guide and in the Managed Active Directory or Federated AD documentation from Google Cloud.

For the use case in this article, you don’t have to use Kerberos as long as Apache Knox acts as the single gatekeeper to the proxy and backend clusters.

Processing engines

The backend clusters are the Dataproc clusters hosting the services that perform the processing of user requests. Dataproc clusters can automatically scale the number of workers to meet the demand from your analyst team, no manual reconfiguration is required.

The backend clusters should preferably be long-lived clusters, to serve ad-hoc requests from data analysts without interruption if the cluster is deleted. However, these clusters can also be job-specific, also known as ephemeral, if they only need to serve requests for a brief time, or as a cost saving strategy. In that case, to avoid modifying the topology configuration, make sure that ephemeral clusters are re-created in the same zone and under the same name, so that Knox can route the requests transparently using the master node internal DNS name when the ephemeral cluster is brought back.

HiveServer2 (HS2) is responsible for servicing user queries made to Apache Hive. It is built using the Apache Thrift framework, so it is sometimes called the Hive Thrift Server. It can be configured to use various execution engines such as the Hadoop Map Reduce engine, Apache Tez, and Apache Spark. In this article, HS2 is configured to use the Apache Tez engine.

Spark SQL is a module of Apache Spark, which includes a JDBC/ODBC interface to execute SQL queries on Apache Spark. In the architecture diagram it is shown as an alternative for servicing user queries. Spark SQL is derived from the work done on HS2, and is also built using the Thrift framework. It is sometimes called the Spark Thrift Server.

The processing engines, either Tez or Spark, call the YARN Resource Manager to execute their DAGs on the cluster worker machines. Finally, these worker machines access the data. For storing and accessing the data in a Dataproc cluster, use the Cloud Storage connector, not HDFS. See the connector documentation for more information about the benefits.

The architecture diagram illustrates one Apache Knox topology to forward requests to Apache Hive, another to Spark SQL, and other topologies that can forward requests to services in the same or different backend clusters. The backend services can process different datasets. For example one Hive instance can offer access to Personally-Identifiable (PII) data for a restricted set of users while another Hive instance can offer access to non-PII data for broader consumption.

User authorization

Apache Ranger can be installed on the backend clusters to provide fine-grained authorization for Hadoop services. In the architecture, a Ranger plugin for Hive intercepts the user requests and determines if a user is allowed to perform an action over Hive data, based on Ranger policies.

As an administrator, you define the Ranger policies using the Ranger Admin UI. These policies are stored in an external Cloud SQL database. Externalizing the policies has two strong advantages: it makes them persistent in case any of the backend clusters are deleted, and also enables the policies to be centrally managed for all or for custom groups of backend clusters.

For you to be able to assign Ranger policies to the correct user identities or groups, Ranger must be configured to sync the identities from the same directory that Knox is connected to. By default, the user identities utilized by Ranger are taken from the operating system.

Apache Ranger can also externalize its audit logs to Cloud Storage to make them persistent. These audits are searchable by leveraging Apache Solr.

Note that as opposed to HiveServer2, Spark SQL does not have official Ranger plugin support, and therefore its authorization must be managed by the coarse-grained ACLs available in Apache Knox. To use these ACLs, add the LDAP identities that are allowed to use each specific service, such as Spark SQL or Hive, in the corresponding topology descriptor for the service.

High availability

Dataproc provides a high availability (HA) mode. In this mode there are several machines configured as master nodes, one of which is in active state. This mode allows uninterrupted YARN and HDFS operations despite any single-node failures/reboots.

However, if the master node fails, the single entry point external IP changes, so you would need to reconfigure the BI tools connections. When you run Dataproc in HA mode, you should configure an External HTTP load balancer as the entry point. The load balancer routes requests to an Unmanaged Instance Group that bundles your cluster master nodes. As an alternative to a fully fledged load balancer, you can apply a round-robin DNS technique instead, but take into account the drawbacks that come from using it. These configurations are outside of the scope of these articles.

Cloud SQL also provides a high availability mode, with data redundancy made possible by synchronous replication between a primary instance and a standby instance located in different zones.

Cloud Storage’s inherent availability, accessibility, latency, durability, and redundancy are illustrated in its storage class descriptions.

Networking

In a layered network architecture, the proxy clusters live in a perimeter network (also known as Demilitarized Zone, or DMZ), isolated from the other clusters, because they are exposed to external requests. Firewall rules give ingress access to the proxy clusters only from a restricted set of source IP addresses, which correspond to the BI tools.

The backend clusters, on the other hand, live in an internal network protected by firewall rules that only let through incoming traffic from the proxy clusters.

The configuration of layered networks is outside of the scope of these articles. The hands-on instructions use the default network only. For more information on layered network setups, see the best practices for VPC Network Security, and the overview and examples on how to configure multiple network interfaces.

Next steps

Continue reading the second part of this article, which walks you step-by- step through the process to set up this architecture on Google Cloud.