Query data in Google Cloud Storage with SQL using Apache Drill

Sushil Kumar
The Startup
Published in
8 min readMay 4, 2020

Google Cloud users are no strangers to BigQuery. Its a petabyte scale serverless warehouse with SQL interface, blazing fast queries and per query pricing. Its one of the most differentiating tool in Google’s offerings. It can read data from external storage systems like Google Cloud Storage or even Google Sheets, giving more flexibility to the end users. Its based on Google’s Dremel paper which was published in 2010.

However for organizations looking to avoid cloud vendor lock in or trying to keep their workloads portable, BigQuery is not an option, which is where Apache Drill, an opensource implementation of Google’s Dremel paper, comes into the scene.

Drill can query data across other data stores as well, such as HBase, MongoDB, Kafka, Object Storage(S3, GCS, Azure Blob) within same query, and allow you to join data from across different systems without creating a unified data storage layer. Drill also has web based Query UI and allows clients to connect via JDBC drivers making it dead easy to integrate to visualization tools such as Tableau.

In this post, I’m going to share how to setup an Apache Drill cluster in GCP and query data stored in GCS using SQL interface.

Lets get started.

Table of Content

  1. Prerequisites
  2. Deploying cluster using Terraform
  3. Configuring GCS storage plugin
  4. Query the data
  5. Limitations
  6. Conclusion

1. Prerequisites

You’ll need following tools to follow along the post.

  1. Terraform, to deploy our cluster using Infrastructure as Code (IaC). Get an introduction to Terraform here.
  2. VS Code (or IDE/Editor of your preference)
  3. Git client installed, to clone the Terraform code repository
  4. Google Cloud Platform account
  5. Service Account (and its JSON Key) with Project Owner role, to be used by Terraform to deploy the resources.
  6. GCS Hadoop Connector JAR file, hit the download icon and select shaded version of jar.
  7. Service Account (and its JSON Key) with Storage Owner role, to be used by Apache Drill cluster to query the data stored in GCS buckets.

2. Deploying cluster using Terraform

Once you are done with the perquisites, go ahead and clone the Github repository where I’ve pushed the Terraform code to deploy .

git clone https://github.com/kaysush/apache-drill-terraform.git

Before you can create the infrastructure, there are some manual steps and changes that you need to do.

Setting up external dependencies

  1. Create a bucket in your GCP project.
  2. Copy the GCS Hadoop Connector jar and Service Account JSON key (Step-6,7 of prerequisites) to the bucket.

Open the code that you have cloned in your editor.

Setting up Terraform service account

Go to provider.tf file and change following section.

provider "google" {
region = "us-central1"
project = "groovy-momentum-269710"
credentials = file("<PATH-TO-SERVICE-ACCOUNT-JSON-FILE>")
}

Replace <PATH-TO-SERVICE-ACCOUNT-JSON-FILE> withe the file path where you have kept the Service Account JSON Key (Step-5 of prerequisites).

Configuring external dependencies in Terraform

Goto main.tf file and change following section.

data "template_file" "core_site_xml" {
template = file("${path.module}/templates/core-site.xml.tpl")
vars = {
gcp_project = "<GCP-PROEJCT-ID>"
}
}
data "template_file" "start_drill" {
template = file("${path.module}/templates/start_drill.sh.tpl")
vars = {
zookeeper_hostname = module.zookeeper.zookeeper_hostname
core_site_content = data.template_file.core_site_xml.rendered
bucket_name = "<BUCKET-NAME>"
sa_key_subpath = "<JSON-KEY-SUBPATH>"
connector_jar_subpath = "<CONNECTOR-JAR-SUBPATH>"
}
}

Make changes to following placeholders.

BUCKET-NAME — Name of bucket created while setting up external dependencies.

JSON-KEY-SUBPATH — Path within bucket where you placed JSON key for service account.

CONNECTOR-JAR-SUBPATH — Path within bucket where you placed GCS Hadoop connector jar.

Now we are ready to deploy our cluster.

Lets initialize our module

terraform init

You should see following output.

Terraform initialization output

Your version for providers may vary depending on when in time you are reading this post.

Next lets see what changes our Terraform code will make. Run plan command to see the changes.

terraform plan

You should see a long output with summary at the end.

Plan: 5 to add, 0 to change, 0 to destroy.

This denotes that there will be 5 resources created by our Terraform code.

Next lets apply these changes. Run apply command.

terraform apply --auto-approve

Wait for it to finish, and then go to your GCP dashboard to see your VMs there.

Apply complete! Resources: 5 added, 0 changed, 0 destroyed.

The Apache Drill machines are up, but it will take 3–4 minutes for Apache Drill to be available. You can access the Drill Web UI using external IP address of any of drill VM.

http://<EXTERNAL-IP-OF-ANY-DRILL-VM>:8047
Apache Drill Web UI

If you see the web UI with 3 drillbits, that means your Apache Drill cluster has successfully started.

3. Configuring GCS storage plugin

Now that our cluster is up and running, lets configure our cluster to read data from GCS.

  1. In the Web UI, go to Storage tab, under Plugin Management hit Create .
  2. Give whatever name you want to give to this plugin. I’m using gcs as plugin name.
  3. In the configuration paste following JSON.
{
"type": "file",
"connection": "gs://<BUCKET-NAME>",
"config": null,
"workspaces": {
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
}

},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"extractHeader": true,
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"avro": {
"type": "avro"
},
"sequencefile": {
"type": "sequencefile",
"extensions": [
"seq"
]
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
}
},
"enabled": true
}

This configures the bucket from where this plugin can read, workspaces within that bucket (basically path within the buckets) and file types which this plugin can read.

Hit Create and this should save your plugin.

We have configured our Apache Drill instance to read and write data to GCS.

4. Query the data

In this section we’ll try to query data stored in GCS. Before we begin lets put some data in our bucket that we wish to query.

CSV

Download a sample csv file from here. Place the FL_insurance_sample.csv file in the root of the bucket configured in the plugin.

Next, go to query tab and run following query.

SELECT * from gcs.`root`.`FL_insurance_sample.csv` LIMIT 100;

The format of this query is given below.

SELECT * from `plugin`.`workspace`.`file-path`

You should see the results.

Projection query result

You can try other complex queries like aggregations.

SELECT County ,count(*) as PerCountyCount from gcs.`root`.`FL_insurance_sample.csv` group by County
Aggregation Query

Parquet

Download sample parquet files from here. Place all the 5 files in a folder named userdata in root of your bucket.

Go to query tab, and run following query. We’ll start by querying 1 file and then all file together.

SELECT * from gcs.`root`.`userdata/userdata1.parquet`
Query result for 1 file — 1000 entries

Next, well query all 5 files in same query.

SELECT count(*) as Total_Count from gcs.`root`.`userdata`
Query across all files — 5000 entries

Writing data to GCS

Not only you can query the data, you can also write the results back, giving you an option to build lightweight SQL based ETL pipelines.

One gotcha while working with non-root workspaces is that, the folder to which the workspace points should already exists. So if you are using tmp workspace which points to /tmp path, you’ll have to create a folder there.

In GCP you can use gsutil to upload a dummy file which to create the folder structure.

touch dummy.txt
gsutil cp dummy.txt gs://<BUCKET-NAME>/tmp/

Note the trailing / its very important.

CREATE TABLE gcs.`tmp`.`/etl_out/aggregates`
AS
SELECT County ,count(*) as PerCountyCount
from gcs.`root`.`FL_insurance_sample.csv`
group by County

You should see following output.

Create Table success output

You can also verify that files are indeed written by going to Storage Browser .

File in Storage Browser

You can also verify by querying the results back.

SELECT * FROM gcs.`tmp`.`/etl_out/aggregates`

You should see following output.

Querying a table created within Drill.

With the new power vested in you by Drill gods, you can now implement SQL based ETL workflows by scheduling jobs using Apache Airflow JDBC Operator.

5. Limitations

If we compare Apache Drill to BigQuery you’ll find that although they both have same underlying principles, BigQuery as a product has some features that Apache Drill do not have.

  1. Elastic Compute — With Apache Drill you do not have elastic compute capabilities. You are limited by the number of compute nodes running. Although, you can scale the cluster anytime, but the existing queries can not take advantage of the newly added nodes.
  2. Limited read bandwidth — With Apache Drill on GCS you are limited by the data retrieval speeds of GCS as opposed to BigQuery’s internal Colossus storage and cross-sectional peta-bit network connection Jupiter.
  3. No IAM Integration — With Apache Drill, there is not integration with IAMs to control access to data and views, as opposed to BigQuery supporting column level ACLs with tables and views.
  4. Pricing — You pay the price of compute instances running 24 X 7 as opposed to per query pricing with BigQuery.

6. Conclusion

Apache Drill gives you flexibility to own your infrastructure and keep it portable across cloud and data centers.

Its also more capable than BigQuery, in a sense that it can query multiple data sources and combine data from all of them. Also getting data out of BigQuery is pricey as every read query has cost to it.

Choosing between BigQuery and Apache Drill is not really an easy choice, there are trade-offs to be made, and final decision really depends on what trade offs are allowed in the final system.

If you find any discrepancies in any part of the post or have any question in general, feel free to drop a comment.

Till then Happy Coding :)

--

--

Sushil Kumar
The Startup

A polyglot developer with a knack for Distributed systems, Cloud and automation.