Google Cloud Platform POC Part 4 — Google Big Query
We are very heavy users of Impala at work and have been using it in production for the last 4 years, pretty much ever since it went into GA. Cloudera is Impala’s main contributor, and Impala came with Cloudera’s distribution, so it was a natural fit for us as we were already an enterprise customer of Cloudera when we were looking for a SQL on Hadoop solution. I did a POC on it, and within half a year, led the BI team and fully moved our Netezza datawarehouse to Impala.
Impala’s performance has been excellent and our users are quite satisfied with the speed and ease of use of Impala when running their often complicated super long SQL queries. We use Microstrategy as our reporting tool and Microstrategy talks directly to our Impala data warehouse and serve reports to hundreds of users everyday, that too, has been working great.
So why look into other solutions when Impala has been working so great for us? I am now being tasked with moving our stack to the cloud, and while a lift and shift strategy is probably the easiest route, it won’t necessarily reduce our costs as we would need to maintain our cluster in the cloud 24 x 7 and we won’t be able to take advantages that going with native cloud solutions might bring. Google BigQuery, being a serverless, fully managed data warehouse, has all the features we need that comes by default and at no extra charge too…features that I have had to manually implement in our on-prem cluster. For example, I recently had to install Cloudera Keytrustee Server and Keytrustee KMS in order to implement data at rest encryption. I am also maintaining an extra cluster which I replicate data to for the sake of disaster recovery. Wherever possible, we also configure High Availability (name node, HAProxy, HA Yarn ResourceManager). All these overhead is taken care of by Google. Another big advantage is we do not have to worry about capacity planning anymore (or perhaps unnecessarily waste resources due to over capacity) and there is hardly, if any, adminstrative tasks such as software upgrades, or hardware monitoring.
For this BigQuery POC, I did two jobs:
The first is to distcp our Impala parquet files as is to GCS, then load them into BigQuery and create BQ tables. The purpose of the first job is to see how easy it is to push existing data to BigQuery to make it queryable immediately in BQ, should that be our strategy.
The second is to run a query in BigQuery which reads the tables that I’ve created in the first step, and insert into another table, in other words, sort of an ETL job. This purpose of the second job is:
a) see if we can do ETL jobs in BigQuery
b) test the peformance of BigQuery and benchmark it against running it in our production Impala, and to see how much changes I have to do and how complicated the query can be. While I cannot post the sql here, I can tell you it’s a 145 line query, with 15 table joins and I used some windows analytic function with a lot of left joins, and the common sql functions. This sql is taken from an actual ETL script used in production
I won’t go over how to set up how to distcp files from your on prem cluster as I have covered it in part 1 of this series.
This is the code for job 1 which I would run on my on premise server:
#!/bin/bash
##Tenny Susanto, 2018–04–04
##push parquet files to GCS and load into BigQuery tables##1) distcp files to GCS
hadoop distcp -overwrite hdfs://my_on_premise_node:8020/user/hive/warehouse/cdw_dwh.db/time_dim_gcp/ gs://our_gcs_bucket/warehouse/cdw_dwh.db/time_dim##2) load into destination table in BQ
bq load — source_format=PARQUET — replace quotient-bi:cdw_dwh.time_dim gs://our_gcs_bucket/warehouse/cdw_dwh.db/time_dim/*.parq
This is the code for job 2 which I would run on my on premise server:
#!/bin/bash
##Tenny Susanto, 2018-04-04
##run ETL sql script for print_fact##1) (if you choose to store your sql scripts in GCS) download the etl sql script from GCS to /tmp
/bin/gsutil cp gs://our_gcs_bucket/Scripts/print_fact.sql /tmp/##2) execute the query - insert results into this table - print_fact_demo
bq query --use_legacy_sql=false --replace --destination_table='cdw_dwh.print_fact_demo' --max_rows=1 --flagfile=/tmp/print_fact.sql
Here are my summarized findings:
- In terms of performance, the query took an average of 105 seconds on our production ETL cluster (40 data nodes, powerful machines with 256MB RAM). In BQ, the performance varied rather widely, sometimes as fast as 62 seconds, sometimes as slow as 110 seconds. In other words, it performs faster or at least as well as Impala
- We use a lot of STRING datatypes in our Impala tables. By default, Impala stores string data in parquet as an unannotated binary field. So when I pushed the data as is to BQ, it couldn’t recognized the data and displayed the data as gibberish data. So I had to change my session settings in Impala by doing “set PARQUET_ANNOTATE_STRINGS_UTF8=true;”, then rewrite the data out to another table, then distcp that new file to BigQuery. We have hundreds of tables, some of them very big, so it would be a huge hassle to rewrite the data (although I could easily make the session setting permanent so going forward it won’t be a problem).
- BQ does not yet support decimal datatype, have to convert to FLOAT first. So again, another hassle for us when pushing the data to BQ.
- Most of our Impala tables have 2 partitions and BQ only supports 1 partition per table, and it has to be a datetime datatype. So that would require us to redesign our tables, rewrite, before we can migrate. Not only that, existing queries / Microstrategy reports will need to change as the schema has changed, thus making migration a bit more involved.
- B̶Q̶ ̶d̶o̶e̶s̶ ̶n̶o̶t̶ ̶s̶u̶p̶p̶o̶r̶t̶ ̶S̶Q̶L̶ ̶s̶t̶a̶t̶e̶m̶e̶n̶t̶s̶ ̶l̶i̶k̶e̶ ̶c̶r̶e̶a̶t̶e̶ ̶t̶a̶b̶l̶e̶ ̶x̶ ̶a̶s̶ ̶s̶e̶l̶e̶c̶t̶ ̶*̶ ̶f̶r̶o̶m̶ ̶y̶.̶ ̶T̶h̶i̶s̶ ̶i̶s̶ ̶p̶r̶o̶b̶a̶b̶l̶y̶ ̶o̶n̶e̶ ̶o̶f̶ ̶t̶h̶e̶ ̶b̶i̶g̶g̶e̶s̶t̶ ̶d̶i̶s̶a̶d̶v̶a̶n̶t̶a̶g̶e̶ ̶a̶s̶ ̶a̶l̶m̶o̶s̶t̶ ̶e̶v̶e̶r̶y̶o̶n̶e̶ ̶h̶e̶r̶e̶ ̶i̶s̶ ̶u̶s̶e̶d̶ ̶t̶o̶ ̶w̶r̶i̶t̶i̶n̶g̶ ̶t̶h̶a̶t̶ ̶t̶y̶p̶e̶ ̶o̶f̶ ̶s̶t̶a̶t̶e̶m̶e̶n̶t̶s̶. Edit (May 2018), it’s supported now.
I still have a lot to investigate and document all, and what needs to happen, if we were to go to BigQuery, but overall, the experience has been relatively positive and quite exciting personally. I have done a demo of BQ to my team and the reception has been quite positive also. My next test is to see how Microstrategy works with BigQuery and will report on this soon.





