Export BigQuery to Google Datastore with Apache Beam/Google Dataflow

Yu Ishikawa
Google Cloud - Community
4 min readSep 17, 2017

--

What if can you export a BigQuery table to Google Datastore without thinking of its schema? I know it is always annoying to make some functions to transfer big data data warehouse to a distributed KVS every time, such as transfer data from Apache Hive to Apache Cassandra or from BigQuery to Google Datastore. If we have such a function, we are able to transfer data more casually.

I have made a kind of exporter from BigQuery to Google Dataflow with Apache Beam on top of Google Dataflow without thinking of table schema. It automatically parses a table schema of a BigQuery table, then adjust them to Google Datastore data types.

GitHub repository

This is the repository to export a BigQuery table to Google Datastore I made.

BigQuery for Data Analysis

BigQuery’s convenience is staggering from a person who used to maintain Apache Hadoop and Apache Hive in the past. There are things that overwhelmingly over hiring data engineers and infrastructure engineers not only in terms of processing speed but also in terms of stability and cost. Recently, I think that many companies are beginning to put BigQuery at the center of the analysis platform.

By enabling large-scale data manipulation with SQL, we expanded the door to analysis of Web scale data to many people as well as engineers familiar with “big data”.

BigQuery for Data Store

It is also very suitable as a “place to store” structured data by lowering the price of BigQuery’s streaming insertion. For example, you can send logs from the server to Google Pu / Sub, pull data with Google Dataflow / Apache Beam’s streaming application, and stream in to BigQuery. In other words, BigQuery can be used as a place for log collection. By storing it in BigQuery, it is attractive that you can analyze immediately by executing the query.

BigQuery for Data Products

Traditionally, in order to create a data product, it is necessary to create an application using a framework for large-scale data such as Apache Hadoop MapReduce and Apache Spark. But after all it does not change in terms of getting, processing and storing data.
The reason to raise the difficulty is that the data we handle is large. I am bothered by how large data can be processed and how we can store large data. The difficulty of this development is to constantly behind the ideas for making data products and delay the time to release.

Also in terms of team development, it is hard to adopt an engineer who can make dedicated large-scale data products. It is not easy to maintain a system for maintaining large-scale data products properly, and labor costs also increase.

As mentioned earlier, BigQuery’s strength is that large data can be manipulated in SQL.
What if you could store BigQuery summary results in a new table and easily export the table to distributed KVS such as Google Datastore? Not only does BigQuery itself democratize large-scale data analysis, it also drastically reduces the entry barriers to data products. There are many things that can be done with SQL alone, and BigQuery also supports UDF, so I will extend it to SQL and beyond.

How to Use

This is an example to export a BigQuery table test_dataset.test_table to a Google Datastore kind TestKind whose parents are KEY('Parent1', 'p1', 'Parent2', 'p2') and whose name space is test_namespace . As well as we can specify a BigQuery column as a Google Datastore key. Where, the column for the key is uuid .

./bigquery-to-datastore.sh \
--project=${GCP_PROJECT_ID} \
--runner=DataflowRunner \
--inputBigQueryDataset=test_dataset \
--inputBigQueryTable=test_table \
--outputDatastoreNamespace=test_namespace \
--outputDatastoreKind=TestKind \
--parentPaths=Parent1:p1,Parent2:p2 \
--keyColumn=uuid \
--tempLocation=gs://test_yu/test-log/ \
--gcpTempLocation=gs://test_yu/test-log/

If you would like to know more details about that, please see the documentation on github.

Type conversions between BigQuery and Google Datastore

The below table describes the type conversions between BigQuery and Google Datastore. Since Datastore unfortunately doesn’t have any data type for time, bigquery-to-datastore ignore BigQuery columns whose data type are TIME.

Conclusion

I described BigQuery not only as a data analysis platform but also a starting point for data products. Basically, I made this configuration depending on Google Cloud Platform for the reason that I do not want to maintain the platform of the data base fundamentally. Of course, you can take the approach of processing large data in SQL, creating a new table, and storing it in distributed KVS such as HBase and Cassandra. This article focused on the meaning of having tools like bigquery-to-datastore.

--

--

Yu Ishikawa
Google Cloud - Community

Data Engineering / Machine Learning / MLOps / Data Governance / Privacy Engineering