Teradata to BigQuery Migration Guide

Darshan Barapatre
Google Cloud - Community
12 min readDec 27, 2022

Migrating data from Teradata to Google BigQuery is a common use case for businesses that want to leverage the scalability and cost-effectiveness of BigQuery for their data analysis needs.

There are a few different approaches that can be taken when migrating data from Teradata to BigQuery. The approach that is best for your organization will depend on your specific needs and requirements.

Some common approaches include:

  1. Direct export and import: This approach involves exporting the data from Teradata using a tool like the Teradata Parallel Transporter (TPT) or Teradata Studio, and then importing the data into BigQuery using the BigQuery web UI, CLI, or API. This is a straightforward method that can be suitable for small to medium-sized data sets.
  2. Data transfer using BQ Data Transfer: The BigQuery Data Transfer Service is a fully-managed, cloud-native data integration service that allows you to schedule and automate the transfer of data from a variety of sources, including on-premises databases like Teradata, to BigQuery.. This approach can be suitable for one-time or periodic data transfers, and is particularly useful for cases where you need to automate the process of migrating partitioning and clustering data in BigQuery.
  3. ETL (extract, transform, load) using Cloud Dataflow: Google Cloud Dataflow is a fully-managed, cloud-native data processing service that allows you to build and execute data pipelines for ETL, batch processing, and stream processing. Dataflow can be a useful tool for data migration, particularly for larger data sets or for cases where more complex transformations are required. Overall, using Cloud Dataflow for a Teradata to BigQuery migration allows you to build and execute a fully-managed data pipeline that can handle the ETL or ELT requirements of your migration project.

Overall, the process of migrating data from Teradata to BigQuery requires careful planning and execution to ensure a successful transfer of data.

Approach-1: BigQuery Data Transfer Service

Teradata Instance Setup

It is possible to set up a Teradata instance on Google Cloud for testing purposes. If you have Teradata running on on-prem server or any other cloud environment, you may need to set up a network and firewall rules for the instance. This may involve creating a virtual private cloud (VPC) and configuring the necessary firewall rules to allow incoming traffic to the instance.

Create a Service Account for the Data Transfer

# Set the PROJECT variable
export PROJECT=$(gcloud config get-value project)

# Create a service account
gcloud iam service-accounts create td2bq-transfer

BigQuery and Cloud Storage admin role binding

# Set TD2BQ_SVC_ACCOUNT = service account email
export TD2BQ_SVC_ACCOUNT=td2bq-transfer@${PROJECT}.iam.gserviceaccount.com

# Bind the service account to the BigQuery Admin role
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:${TD2BQ_SVC_ACCOUNT} \
--role roles/bigquery.admin

# Bind the service account to the Storage Admin role
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:${TD2BQ_SVC_ACCOUNT} \
--role roles/storage.admin

Create the Teradata GCE instance

# Create the Teradata GCE instance
gcloud compute instances create teradata-vantage-express \
--zone=us-central1-a \
--machine-type=n2-custom-4-8192 \
--create-disk=boot=yes,device-name=ve-disk,image-project=ubuntu-os-cloud,image-family=ubuntu-2004-lts,size=70,type=pd-balanced \
--enable-nested-virtualization \
--tags=ve
--service-account=${TD2BQ_SVC_ACCOUNT}

Follow the Teradata official guide to run Vantage Express on Google Cloud

https://quickstarts.teradata.com/vantage.express.gcp.html

Start Teradata VM on Compute Engine Instance

# ssh to the VM
gcloud compute ssh teradata-vantage-express --zone=us-central1-a

# switch to root user
sudo -i
export VM_IMAGE_DIR="/opt/downloads/VantageExpress17.20_Sles12"
DEFAULT_VM_NAME="vantage-express"
VM_NAME="${VM_NAME:-$DEFAULT_VM_NAME}"
vboxmanage createvm - name "$VM_NAME" - register - ostype openSUSE_64
vboxmanage modifyvm "$VM_NAME" - ioapic on - memory 6000 - vram 128 - nic1 nat - cpus 4
vboxmanage storagectl "$VM_NAME" - name "SATA Controller" - add sata - controller IntelAhci
vboxmanage storageattach "$VM_NAME" - storagectl "SATA Controller" - port 0 - device 0 - type hdd - medium "$(find $VM_IMAGE_DIR -name '*disk1*')"
vboxmanage storageattach "$VM_NAME" - storagectl "SATA Controller" - port 1 - device 0 - type hdd - medium "$(find $VM_IMAGE_DIR -name '*disk2*')"
vboxmanage storageattach "$VM_NAME" - storagectl "SATA Controller" - port 2 - device 0 - type hdd - medium "$(find $VM_IMAGE_DIR -name '*disk3*')"
vboxmanage modifyvm "$VM_NAME" - natpf1 "tdssh,tcp,,4422,,22"
vboxmanage modifyvm "$VM_NAME" - natpf1 "tddb,tcp,,1025,,1025"
vboxmanage startvm "$VM_NAME" - type headless
vboxmanage controlvm "$VM_NAME" keyboardputscancode 1c 1c

Download the Teradata client and jar

# Download file
curl -o FILE_NAME "URL"

# Extract the file
tar -xvf TeradataToolsAndUtilitiesBase__ubuntu_x8664.17.20.08.00.tar.gz

After extracting utility file, install tptbase to use tbuild utility

# Change directory
cd TeradataToolsAndUtilitiesBase

# Install
./setup.sh tptbase

Follow the same steps to download Teradata JDBC Drivers from the Teradata JDBC driver download page

  • Unzip the file in home directory to get terajdbc4.jar

Install the Transfer Agent

The agent is part of the BigQuery Data Transfer Service which automates the movement of data between Teradata and BigQuery

# Install Java
sudo apt-get install openjdk-8-jre-headless

# Download transfer agent
sudo wget -P /root/ \
https://storage.googleapis.com/data_transfer_agent/latest/mirroring-agent.jar

ssh to VM and query database

# ssh to Vantage Express VM. Use root as password
ssh -p 4422 root@localhost

# Validate if the DB is up
pdestate -a

Once Vantage Express is up and running, start bteq client command line client. When asked for the password, enter dbc

>> bteq
Enter your logon or BTEQ command:
.logon localhost/dbc
Password: dbc

Run Test Queries

Initially, there won’t be any database or tables present. You need to import tables from local files or create table using SQL and insert data manually. Create the environment structure as per your requirements in this step.

SELECT * FROM HR.Employees;

# Close BTEQ
.QUIT

Create Transfer using BigQuery Console

In the Google Cloud console,

  • Go to the BigQuery page
  • Click Data transfers
  • Click Create transfer

Provide the appropriate information as per the requirements.

Data Transfer Configuration Details

Provide the destination and source details:

  • Destination dataset
  • Source database type. Select Teradata.
  • Provide Cloud Storage bucket path. This bucket will be used to store intermediate data during the migration process.
  • Source database name
  • Table name patterns
    - To select specific tables, provide pipe separated table names as table1|table2
    - Specify .* to process all the tables in source database
  • For Service account email, enter the email attached to the IAM service account.
  • For the Schema file path option, provide the path to the Custom Schema file. If you don’t want to use one, keep it empty and BigQuery will automatically detects the table schema.
Data Transfer Source Details
  • Click Save

This will successfully create the transfer. You can view the details such as Display Name, Schedule, Destination Dataset, etc on the transfer details page.

  • On the Transfer details page, click the Configuration tab.
    - Keep a note of resource name as we’ll use it to run the migration agent in next steps.

Custom Schema File

Refer below json as a sample schema file. You can specify source database name under originalName and the target database under name after migration. Same goes for all the columns in the table. In addition, you can specify target datatype for a column and usageType. Specify DEFAULT as usage type. This usageType indicates that the column has no special use in the source system. This is the default value.

{
"databases": [
{
"name": "hr",
"originalName": "hr",
"tables": [
{
"name": "employees",
"originalName": "employees",
"columns": [
{
"name": "ID",
"originalName": "ID",
"type": "INT64",
"originalType": "integer",
"usageType": [
"DEFAULT"
]
},
{
"name": "NAME",
"originalName": "NAME",
"type": "STRING",
"originalType": "character",
"usageType": [
"DEFAULT"
]
},
{
"name": "UPDATE_DATE",
"originalName": "UPDATE_DATE",
"type": "DATE",
"originalType": "date",
"usageType": [
"PARTITIONING"
]
},
{
"name": "DEPARTMENT_CODE",
"originalName": "DEPARTMENT_CODE",
"type": "STRING",
"originalType": "character",
"usageType": [
"DEFAULT"
]
}
]
}
]
}
]
}

Incremental Updates

  • To handle incremetnal updates, you can specify COMMIT_TIMESTAMP as usageType and supportes data types for the same are TIMESTAMP or DATE. As of today, we can specify only one column per table with this usage type. This column is used to extract rows created/updated since the last transfer run.

Partitioned Table

  • If you have partitioned table to migrate, you can specify PARTITIONING as usageType and supportes data types for the same are TIMESTAMP or DATE. This column is used in the partitioned table definition for the containing tables object.

Clustering

  • If you have partitioned table to migrate, you can specify CLUSTERING as usageType We can annotate up to four columns with this usage type. The column order for clustering is determined based on the order in which they appear in the custom schema.

Initialize the agent

Switch back to Teradata terminal. Initialize the bigquery migration agent which follows the form:

java -cp terajdbc4.jar:mirroring-agent.jar com.google.cloud.bigquery.dms.Agent - initialize

When prompted, configure the following options:

  • Provide directory for locally extracted files
  • localhost as Database hostname
  • Provide connection port or enter to use default port
  • Type yes to use TPT to unload data
  • Provide path to database credential file or enter to provide credentials during runtime. Sample contents of the credential file:
username=dbc
password=dbc
  • Provide the resource name copied from BigQuery Transfer Details Console in the config name.
  • Finally, the configuration file path.
root@teradata-vantage-express:~/export_data# java -cp ~/mirroring-agent.jar com.google.cloud.bigquery.dms.Agent - initialize
Reading data from gs://data_transfer_agent/latest/version

Would you like to write the default TPT template to disk? (yes/no): yes
Enter path to write default TPT script template: tpt_template
Enter directory path for locally extracted files: /tmp
Enter Database hostname: localhost
Enter Database connection port (press enter to use default port):
Would you like to use Teradata Parallel Transporter (TPT) to unload data? Recommended - yes. (yes/no): yes
Enter database credentials file path:
Enter BigQuery Data Transfer Service config name (press enter to create new transfer config): projects/project-id/locations/us/transferConfigs/639641d3–0000–2061–9b3a-240588725d64
Enter configuration file path: config.json

Configuration file successfully created!

With above details, initialization process will create two files: tpt_template and config.json

Step 4: Run the agent

java -cp ~/mirroring-agent.jar:terajdbc4.jar com.google.cloud.bigquery.dms.Agent --configuration-file=config.json

If prompted, provide Teradata username and password.

If all the provided configurations are correct, agent will list the specified tables and transfer them to destination dataset in BigQuery.

Approach-2: Using TPT Export Utility + gsutil

In general, any data transfer can be categorise into ETL process or ELT process. If we don’t want to perform any transformations to the source data, BigQuery Data Transfer Service is the go to option.

If your use-case needs any transformations or pre-processing to be applied on source data, using Cloud Storage as a staging area and performing transformations via Dataflow could be one of the solution.

Data transfer flow using TPT Export and Dataflow

Exporting Data out of Teradata

Teradata provides two data export utilities capable of sending data from a Teradata platform:

  • BTEQ and
  • FastExport

Teradata Parallel Transporter Utility (TPT) merges the functionalities of Fastload, Multiload, TPUMP, BTEQ, and Fastexport into one utility.

Process

Define a TPT script

There are 3 components in the script

  • Define EXPORT_DATA job. This includes type of operator, Consumer_Opp in our case, schema and output attributes such as output filename, delimiter, mode, etc.
  • Then comes schema for the table that needs to be exported.
  • Lastly, the Producer_Export operator. Here we define the source db credentials, teradata process id, session info, etc. You have to provide SelectStmt here which can be modified to create full-load as well as incremental-load

Sample td_export.tpt file

DEFINE JOB EXPORT_DATA
(
DEFINE OPERATOR Consumer_Opp
DESCRIPTION 'Consumer Operator'
TYPE DATACONNECTOR CONSUMER
SCHEMA *
ATTRIBUTES
(
VARCHAR FileName='employee.csv',
VARCHAR FORMAT= 'DELIMITED',
VARCHAR OpenMode='Write',
VARCHAR TextDelimiter = ','
);
DEFINE SCHEMA Employee_Schema
(
GlobalID INTEGER,
FirstName VARCHAR(30),
LastName VARCHAR(30),
DateOfBirth CHAR(10),
JoinedDate CHAR(10),
DepartmentCode BYTEINT
);
DEFINE OPERATOR Producer_Export
TYPE EXPORT
SCHEMA Employee_Schema
ATTRIBUTES
(
VARCHAR UserName='dbc',
VARCHAR UserPassword='dbc',
VARCHAR SelectStmt = 'select * from HR.EMPLOYEES;',
VARCHAR Tdpid='localhost',
INTEGER MaxSessions=5,
INTEGER minsessions=1,
VARCHAR DateForm='ANSIDATE'
);
APPLY
TO OPERATOR( Consumer_Opp )
SELECT * FROM OPERATOR( Producer_Export[2] );
);

Run the script:

tbuild -f td_export.tpt
  • Outputs a CSV file on local server

Use gsutil to export local files to GCS:

gsutil cp *.csv gs://bucket-name/dest-folder

After this step, there can be two ways to load data into BigQuery.

Route-1: Load directly into BQ

To load data directly from a Cloud Storage data source into BigQuery use bq load command by providing source data format, gcs file path and output schema file.

bq load \
--source_format=CSV \
td2bq.employees_gcs \
gs://bucket-name/dest-folder/employee.csv \
./employee_schema.json

Route-2: Using Dataflow

If you need to perform any data cleaning or transformation on source data before loading into BigQuery, you can use Dataflow. Dataflow is a serverless, fast and cost-effective service that supports both stream and batch processing.

Data transfer flow for data quality checks using Dataflow

Process:

  • Enable the dataflow API
  • Dataflow uses cloud storage bucket as a staging location to store temporary files. Create GCS folder for staging and temp data <gs://bucket-name/GCS_to_BQ_Dataflow/temp> & <gs://bucket-name/GCS_to_BQ_Dataflow/staging>
  • Go to Google Cloud console
  • Click on Activate Cloud Shell
  • Install apache beam dependency sudo pip3 install apache_beam[gcp]
  • Create quality_check.py as shown below:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

FILEPATH = 'gs://bucket-name/employee.csv'


def calculate_age(data):
from datetime import date, datetime
today = date.today()
dob = datetime.strptime(data['date_of_birth'], '%Y-%m-%d')
data['age'] = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
return data


def run():
with beam.Pipeline(options=PipelineOptions()) as pipeline:
(pipeline | 'Read Data' >> beam.io.ReadFromText(FILEPATH)
| beam.Map(lambda x: x.split(','))
| 'Convert to Dictionary' >> beam.Map(
lambda x: {"id": x[0], "first_name": x[1], "last_name": x[2], "date_of_birth": x[3],
"date_of_joining": x[4], "department_code": x[5]})
| 'CalculateAge' >> beam.Map(calculate_age)
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
project='project-id',
dataset='td2bq',
table='emp_data',
schema='id:INTEGER,first_name:STRING,last_name:STRING,date_of_birth:STRING,'
'date_of_joining:STRING,' 'department_code:INTEGER,age:INTEGER',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)


if __name__ == '__main__':
run()

Above is a sample python script which reads data from GCS bucket, populates age field from date_of_birth and writes back this data into destination BigQuery table with the new schema.

You can modify above script to apply data cleansing, transformations, validations and other data quality checks.

  • Run the Dataflow Job
python3 quality_check.py \
--runner DataFlowRunner \
--project jarvis-365810 \
--temp_location gs://bucket-name/GCS_to_BQ_Dataflow/temp \
--staging_location gs://bucket-name/GCS_to_BQ_Dataflow/staging \
--region us-central1 \
--job_name gcs2bq

In the Google Cloud console:

  • Go to Dataflow
  • Click on Jobs
  • From the Job details page click on the job name and track the progress.
Dataflow Job

On the successful completion of the job, table emp_data will be created under dataset td2bq in BigQuery. Check the schema & preview the table to check the correctness of data.

Translate queries

With BigQuery, you can translate SQL queries in two different ways:

1. Using Batch SQL Translation

In the Google Cloud console,

  • Go to the BigQuery page
  • Under Migration click on SQL Translation
  • Click Start Translation

Specify the translation configuration as follows:

SQL Translation Configuration

In Source details section, provide location of GCS folder which has all the Teradata compatible SQL scripts that needs to be translated to BigQuery SQL.

Source Details for SQL Translation

Similarly, in Target details section, provide destination path for the output scripts.

In the Optional settings,

  • You can provide the default database name. When source query does not specify one, it picks this name.
  • You can also provide source-destination name mapping via json file
  • Or, with manual input, you can provide name mapping pairs
Optional Settings

Once you click Create , a translation job will be created and triggered at the same time. You can check the progress and translation details by clicking on the job name.

SQL Translation Details page

Once the job is completed, you can check the destination GCS path provided in Target details section. In the output folder, you’ll find all the BigQuery compatible SQL scripts if job ran successfully for all the files.

Output Bucket Contents

The batch SQL translator outputs the following files to the specified destination:

  • The translated files.
  • The translation summary report in CSV format.
  • The consumed output name mapping in JSON format.

2. Interactive SQL Translation

In the Google Cloud console,

  • Go to the BigQuery page
  • Go to SQL Workspace
  • In the query editor, click MORE and select Enable SQL Translation
  • Window will split into two. On the left editor window, click Translating from and select source data warehouse Teradata
  • Paste the Teradata SQL query in the editor. Click Translate which will give you the equivalent BigQuery Standard SQL query.
Ad-hoc query translation

Further Steps

  • Migrate business applications
  • Migrate data pipelines
  • Optimize performance
  • Verify and validate

If you enjoyed reading this post or found it useful in any way, please drop some claps and comments. I would love to hear your feedback and encouragement to write more on Data Engineering.

Feel free to engage and connect with me on LinkedIn

--

--