How to Load Data from Google BigQuery to Vertica?

Onur Taşhan
Trendyol Tech
Published in
3 min readMay 31, 2021

Hello :)

As Trendyol DWH Team, we are moving our cloud-suitable ETL streams from the on-prem Vertica Database to the Google BigQuery environment, but in some cases we may need to restore the data we have processed in Google BigQuery back to Vertica.

In such cases, we have prepared a small python code to quickly load the data from BigQuery into Vertica.

Let’s examine the python code together; Our code contains two functions called get_data_bigquery and load_data_vertica. While the code is running, it takes three input parameters.

  1. SQL file containing our query in BigQuery. (Line 104)
  2. YAML file where we define how the code should work. (Line 110)
  3. Connection file that will allow us to connect to Vertica. (Line 119)

Let’s start with the main function first;

  • We are making basic configuration definitions for the logging library.
  • We set the given first input parameter as a SQL file.
  • We set the second input parameter as YAML file. We assign the contents of the file we set to a variable called params.
  • We set the given third input parameter as Vertica Connection file. We assign the contents of the file we set to a variable called connection_info.
  • We parse the yaml values in the params variable and throw them into their respective variables.
  • We give our SQL file as input to get_data_bigquery function.
    We take the function SQL file and run it in BigQuery and throw the return result to a variable named df and return this function.
  • In the load_data_vertica function, we give connection_info, df, target_table, load_type, origin_snapshot variables as input.

    The function df iteratively writes the data in the variable to a new line delimeted json file. When it completes the writing process, it connects to Vertica and uploads the data in json to Vertica. When it completes the load process, it checks the number of lines in the file and the number of lines loaded into the DB, deletes the file it created and completes the process.

Before going into the details of the two functions, let me talk about the contents of SQL, YAML and Connection files.

Let’s start with the SQL file; We give our BigQuery query between the 4th and 10th lines in the SQL file that I shared an example in the gist below. And we are converting this query to json with BigQuery built-in function TO_JSON_STRING.

If it is our YAML file (I am sharing the example in the gist below); There are load and snapshot areas. The load field is a mandatory step, while the snapshot field is an optional field.

In the subgroup of the load area; We enter the table information to load in Vertica and how it will load (WRITE_TRUNCATE or APPEND_ONLY).

In the subgroup of the snapshot area; We make sure that the table we have given origin field is created and filled with the table name we have given in target_table in the same format 1–1. With this option, our code creates a new table in the format of the origin table and writes the data to the new table.

The content of our Connection file is as in the gist below.

In this way, we have met our little need with a python code. Apart from this method we used, alternative options were available; It’s like parquet export — external table, but this is how we solved our need at the first stage.

Have a nice coding! :)

--

--