Hive to BigQuery: Move Data efficiently using GCP Dataproc Serverless
One of the main requirements of any Cloud Migration activity is to migrate data from on-premise/cloud servers to modern data warehouses like BigQuery. Migrating data from Hadoop to BigQuery is a fairly common use case in the industry. It usually includes multistep process involving data migration to GCS using DistCP or other methods and then building a pipeline from GCS to BigQuery.
Dataproc Templates using VertexAI notebook and Dataproc Serverless provides one stop solution to migrate data directly from Hive Tables from any Hadoop environment to GCP BigQuery.
Setup VertexAI Notebook:
- Enable below services in GCP project from API Console:
- Compute Engine API
- Dataproc API
- Vertex-AI API
- Vertex Notebooks API
2. Create a User-Managed Notebook in Vertex AI Workbench.
3. Clone Dataproc Template GitHub repo using the GIT tab as shown in the below screenshot
or open a terminal from launcher window and clone using git clone.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
4. From the folders tab, open HiveToBigquery_notebook.ipynb notebook present in the path: dataproc-templates/notebooks/hive2bq
5. For the convenience of users, following steps are marked sequentially and implemented in the Python notebook.
Step 1: Install required packages
Some of the packages needed for the migration needs to be installed separately as they are not available in the notebook e.g. PySpark, JDK, etc.
Step 2: Set Google Cloud properties
Below configurations need to be set before running the notebook:
REGION
: GCP Region to unload Hive tables in BQ.GCS_STAGING_LOCATION
: GCS bucket to store artefacts.SUBNET
: VPC SubnetINPUT_HIVE_DATABASE
: Hive database for input tablesINPUT_HIVE_TABLES
: Comma separated Hive table names to move, to move all the tables from a database put "*"OUTPUT_BIGQUERY_DATASET
: BigQuery dataset for the output tablesTEMP_BUCKET
: Temporary GCS bucket to store intermediate filesHIVE_METASTORE
: Hive metastore URIMAX_PARALLELISM
: Number of parallel Dataproc Jobs to run (default=10)HIVE_OUTPUT_MODE
: Can be one of overwrite/append/ignore/errorifexists
Step 3: Import needed libraries like pandas and SparkSession.
Step 4: Set current working directory to python folder as all the required artefacts are available there.
Step 5: Build python egg package
HiveToBigQuery notebook at the back end makes use of existing Hive to BigQuery template, which needs python package in the form of egg file.
Step 6: Copy the artifacts to GCS bucket
Copy the above package file along with existing main.py to the user provided GCS bucket.
Step 7: Get HIVE tables
In case the list of Hive tables to migrate is not provided by the user in Step 2, this cell will connect to the Hive thrift URI provided and get the list of all the tables present in the database using SparkSession.
Step 8: Create Job Set
Based on the MAX_PARALLELISM value provided in Step 2, multiple jobs will run in parallel to migrate HIVE tables to BigQuery using Dataproc Serverless batch jobs. Run this cell to group list of tables in sets which will then run in parallel.
Step 9: Set Dataproc Template Properties like JAR paths and egg package path to be passed to the VertexAI pipeline.
Step 10: Build Vertex AI Pipelines
With the help of Dataproc Serverless components for Vertex AI Pipelines, a Dataproc serverless batch job can be triggered which at the back end will run HIVE to BQ template and load table sets (refer Step 8) to BigQuery in parallel.
Step 11: Trigger job and update Audit table
The final cell is responsible for triggering Vertex AI pipeline built in the previous step and also updating the status of each table set in the Audit table.
6. Run all the cells one by one or all at once using the option given in Menu bar. At the end of step 11, a link will get generated for each table set which can be used to monitor the jobs. Alternatively, detailed logs can be seen from Dataproc Batch UI as well.
Audit Table
The template stores audit data for each load in CSV format in GCS bucket provided.
In order to view the data create an external table pointing to the GCS bucket as below.
CREATE EXTERNAL TABLE `<project-id>.<dataset-name>.hive_bq_audit`
(
Source_DB_Name STRING,
Source_Table_Set STRING,
Target_DB_Name STRING,
Target_Table_Set STRING,
Job_Start_Time STRING,
Job_End_Time STRING,
Job_Status STRING
)
OPTIONS(
format="CSV",
uris=["gs://<bucket-name>/audit/*"]
);
Limitations:
- Partitioned tables:
The current version does not automatically create partitions in the BigQuery. In order to replicate Hive partitioning in BigQuery, DDL with partition and cluster columns has to be created manually. Once table structure is created in BigQuery, the above process will take care of inserting data into partitions. BigQuery translation API can be used to convert Hive DDL to BigQuery format.
Note: Read through BigQuery partition limitations here. - Incremental load:
The current version supports overwrite and append mode only. Change this value in Step 2 above based on the requirement. - Kerberos Authentication:
User has to implement Kerberos authentication themselves if needed.
For any queries/suggestions reach out to:
dataproc-templates-support-external@googlegroups.com