Apache Druid Production Setup in Google Cloud Platform with Dataproc cluster — Part 1

Published in

Zeotap — Customer Intelligence Unleashed

6 min readFeb 19, 2020

Source: https://unsplash.com/photos/MnHQMzC6n-o

Apache Druid (incubating) is a real-time analytics database designed for fast slice-and-dice analytics (“OLAP” queries) on large data sets. Druid is most often used as a database for powering use cases where real-time ingest, fast query performance, and high uptime are important.

To know about how druid is being used in zeotap, please refer this blog from Imply. https://imply.io/post/apache-druid-helps-zeotap-master-multi-channel-attribution

In this two part series, we will demonstrate how to setup Apache Druid Production setup in Google Cloud Platform and migrate druid segments from AWS to GCP.

In Part 1, let’s see how to install and configure Apache Druid in GCP.

Infra Requirements:

1. GCE instances (4) for Druid

1 for broker and router (n2-standard-8)
1 for coordinator and overlord (n2-standard-8)
2 for historical and middlemanager (n2-standard-32)

2. CloudSql instance (PgSql/MySql) for Metadata storage

3. GCE instance with Zookeeper installed

4. GCS bucket for Deep Storage

5. Dataproc Cluster

Architecture:

Druid has a multi-process, distributed architecture making it easy to operate and each process can be configured and scaled independently.

For the ease of deployment, Druid organizes processes into following types.

Master: Coordinator and Overlord

Query: Broker and Router

Data: Historical and Middlemanager

In addition to these processes, druid also has external dependencies.

Deep Storage
Metadata Storage
Zookeeper
Hadoop Cluster

These dependencies information are configured in the common.runtime.properties file.

Installation:

Download the druid binary from apache website and extract in all 4 machines. In this example, I’m using druid-0.14.1-incubating version.

# Druid
$ wget https://archive.apache.org/dist/incubator/druid/0.14.1-incubating/apache-druid-0.14.1-incubating-bin.tar.gz$ sudo tar -xvf apache-druid-0.14.1-incubating-bin.tar.gz$ sudo mv apache-druid-0.14.1-incubating /opt/$ sudo ln -sfn /opt/apache-druid-0.14.1-incubating /opt/druid

Configuration:

Deep Storage:

Druid doesn’t have inbuilt storage mechanism, so we have to configure a deep storage where druid will store the segments.

Deep storage can be a local filesystem, nfs, HDFS, S3, GCS, Azure Blob store.

In this setup, we are going to use GCS as our deep storage.

# Deep storage — gcs
druid.storage.type=google
druid.google.bucket=$BUCKET
druid.google.prefix=druid/segments# Indexing job logs
druid.indexer.logs.type=google
druid.indexer.logs.bucket=$BUCKET
druid.indexer.logs.prefix=druid/indexing-logs

Druid google extensions jars are required to use GCS as deep storage. You can pull dependencies from the apache maven repository for respective versions.

$ sudo java -cp “/opt/druid/lib/*” -Ddruid.extensions.directory=”/opt/druid/extensions” -Ddruid.extensions.hadoopDependenciesDir=”/opt/druid/hadoop-dependencies” org.apache.druid.cli.Main tools pull-deps — no-default-hadoop -c “org.apache.druid.extensions.contrib:druid-google-extensions:0.14.1-incubating”

Metadata storage:

Druid uses it to store various metadata about the system, such as tasks history, datasource schemas etc.

It is recommended to use Mysql or Postgresql in production setup.

# Metadata storage — PostgreSQLdruid.metadata.storage.type=postgresqldruid.metadata.storage.connector.connectURI=jdbc:postgresql://$DB_HOST:5432/$DBNamedruid.metadata.storage.connector.user=$DBUSERdruid.metadata.storage.connector.password=$DBPASSWORD

Zookeeper:

Apache Druid uses Apache ZooKeeper (ZK) for the management of current cluster state. The operations that happen over ZK are

Coordinator, Overlord leader election
Overlord and MiddleManager task management

It’s recommended to have a separate Zookeeper cluster in the production setup. For development purposes, we can install Zookeeper in one of the druid nodes.

# Zookeeper
druid.zk.service.host=$ZKHOST
druid.zk.paths.base=/$PATH
druid.discovery.curator.path=/$PATH/discovery

Dataproc Cluster:

All data in Druid is organized into segments. Loading data in Druid is called ingestion or indexing and consists of reading data from a source system and creating segments based on that data, store it in deep storage.

Google Cloud provides Dataproc, fully managed cloud service for running Apache Spark and Apache Hadoop clusters.

In this setup, we are going to configure Druid to run Hadoop Batch ingestion on the Dataproc cluster.

Hadoop and Google extensions jar

hadoop-client:2.8.3 is the default version of the Hadoop client bundled with Druid. This works with many Hadoop distributions, but if you run into issues, you can instead have Druid load libraries that exactly match your distribution.

Our jobs have dependency on hadoop-client:2.7.3, so I’m pulling jars for that version.

$ sudo java -cp “/opt/druid/lib/*” -Ddruid.extensions.directory=”/opt/druid/extensions” -Ddruid.extensions.hadoopDependenciesDir=”/opt/druid/hadoop-dependencies” org.apache.druid.cli.Main tools pull-deps — no-default-hadoop -h “org.apache.hadoop:hadoop-client:2.7.3”

Cloud storage connector jar

This is required as the batch ingestion job will run on dataproc clusters and dataproc needs to access GCS to store the segments.

$ wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar$ cp gcs-connector-hadoop2-latest.jar /opt/druid/lib/

Place Hadoop XMLs on Druid classpath

To run the hadoop based ingestion, we have to copy the xml files from /etc/hadoop/conf in the Dataproc cluster and place it under /opt/druid/conf/druid/_common/ directory in druid machines.

This allows Druid to find the dataproc hadoop cluster and submit ingestion jobs.

Common.runtime.properties

This is an important piece of our setup as all the external dependencies such as Deep Storage, Metadata storage, Zookeeper information are provided here.

Let’s add all the dependencies properties we have seen so far.

# Extensions to be loaded
druid.extensions.loadList=[“druid-histogram”, “druid-datasketches”, “druid-lookups-cached-global”, “druid-google-extensions”, “druid-avro-extensions”, “postgresql-metadata-storage”]# Extensions dir and hadoop dependencies dir
druid.extensions.directory=/opt/druid/extensions
druid.extensions.hadoopDependenciesDir=/opt/druid/hadoop-dependencies# Zookeeper
druid.zk.service.host=$ZKHOST
druid.zk.paths.base=/$PATH
druid.discovery.curator.path=/$PATH/discovery# Metadata storage — PostgreSQL
druid.metadata.storage.type=postgresql
druid.metadata.storage.connector.connectURI=jdbc:postgresql://$DB_HOST:5432/$DBName
druid.metadata.storage.connector.user=$DBUSER
druid.metadata.storage.connector.password=$DBPASSWORD# Deep storage — gcs
druid.storage.type=google
druid.google.bucket=$BUCKET
druid.google.prefix=druid/segments# Indexing service logs — gcs
druid.indexer.logs.type=google
druid.indexer.logs.bucket=$BUCKET
druid.indexer.logs.prefix=druid/indexing-logs# Service discovery
druid.selectors.indexing.serviceName=druid/overlord
druid.selectors.coordinator.serviceName=druid/coordinator

Once the configuration is done, we are good to start the druid processes in the respective instances.

sudo /opt/druid/bin/broker.sh start
sudo /opt/druid/bin/router.sh start
sudo /opt/druid/bin/overlord.sh start
sudo /opt/druid/bin/coordinator.sh start
sudo /opt/druid/bin/historical.sh start
sudo /opt/druid/bin/middleManager.sh start

Common issues

Hadoop dependency [/opt/druid/hadoop-dependencies/hadoop-client/2.7.3] didn’t exist!?

Error:

org.apache.druid.java.util.common.ISE: Hadoop dependency [/opt/druid/hadoop-dependencies/hadoop-client/2.7.3] didn’t exist!?at org.apache.druid.initialization.Initialization.getHadoopDependencyFilesToLoad(Initialization.java:281) ~[druid-server-0.14.0-incubating.jar:0.14.0-incubating]

Solution:

This indicates that the job has the dependency of hadoop-client 2.7.3 which is not available under the druid.extensions.hadoopDependenciesDir

Add the dependencies to druid class path

sudo java -cp “/opt/druid/lib/*” -Ddruid.extensions.directory=”/opt/druid/extensions” -Ddruid.extensions.hadoopDependenciesDir=”/opt/druid/hadoop-dependencies” org.apache.druid.cli.Main tools pull-deps — no-default-hadoop -h “org.apache.hadoop:hadoop-client:2.7.3”

2) gcs.GoogleHadoopFileSystem not found

Error:

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) ~[?:?]at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) ~[?:?] at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) ~[?:?]at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) ~[?:?]at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) ~[?:?]

Solution:

This indicates that the dataproc cluster can’t access the GCS. Copy the gcs-hadoop-connector jar to druid lib path and restart the druid services.

cp gcs-connector-hadoop2-latest.jar /opt/druid/lib/

3) Unknown provider google

Error:

Unknown provider[google] of Key[type=org.apache.druid.segment.loading.DataSegmentPusher, annotation=[none]], known options[[local]]

Solution:

This error says that for “DataSegmentPusher class” which pushes segments on GCS is able to find only “local” as a config option. Means it can only push the data to local disk. This is an indication that google druid extension is not loaded or configured correctly or there is an issue with the extension itself.

Verify the jars under druid-google-extensions folder. Ensure the jars are not corrupt and matches the version of druid.

Maven repository for druid-google-extensions: https://mvnrepository.com/artifact/org.apache.druid.extensions.contrib/druid-google-extensions

$ sudo java -cp “/opt/druid/lib/*” -Ddruid.extensions.directory=”/opt/druid/extensions” -Ddruid.extensions.hadoopDependenciesDir=”/opt/druid/hadoop-dependencies” org.apache.druid.cli.Main tools pull-deps — no-default-hadoop -c “org.apache.druid.extensions.contrib:druid-google-extensions:0.14.1-incubating”

4) Jobs running in Overlord doesn’t show logs

Error:

No log was found for this task. The task may not exist, or it may not have begun running yet.

Solution:

Overlord stores the logs of ingestion tasks in the deep storage. The error indicates that the indexer.logs property is not configured properly.

Ensure Overlord machine has access to the indexer.logs.bucket.

# Indexing service logs — gcs
druid.indexer.logs.type=google
druid.indexer.logs.bucket=$BUCKET
druid.indexer.logs.prefix=druid/indexing-logs

5) The MR jobs triggered successfully. However, there was no matching data after completion of mapreduce job.

Solution:

After finishing the determine_partitions_hashed MapReduce job and successfully determining the number of partitions, the druid ingestion task fails and states the error as No data found in bucket.

Typical cause for this issue is that no data was available in the time range mentioned in the ingestion task spec.

However, we found another cause for this issue in our case. A careful look in the Map Reduce job logs told that the job could not actually write the partitioning info in the default directory(var/druid/hadoop-tmp) determined by the config druid.indexer.task.hadoopWorkingPath in middleManager’s runtime properties.
Changing this path to something like /tmp/druid-ingestion solved the issue.

Co-Author:

Nipun Jain is a Software Engineer at Zeotap. He has his undergraduate degree majoring in Computer Science from BITS Pilani. He has keen interest in databases and big data pipelines.

Apache Druid Production Setup in Google Cloud Platform with Dataproc cluster — Part 1

Infra Requirements:

Architecture:

Installation:

Configuration:

Common.runtime.properties

Common issues

Written by Kannan Anandakrishnan