Teradata to BigQuery Migration Guide
Migrating data from Teradata to Google BigQuery is a common use case for businesses that want to leverage the scalability and cost-effectiveness of BigQuery for their data analysis needs.
There are a few different approaches that can be taken when migrating data from Teradata to BigQuery. The approach that is best for your organization will depend on your specific needs and requirements.
Some common approaches include:
- Direct export and import: This approach involves exporting the data from Teradata using a tool like the Teradata Parallel Transporter (TPT) or Teradata Studio, and then importing the data into BigQuery using the BigQuery web UI, CLI, or API. This is a straightforward method that can be suitable for small to medium-sized data sets.
- Data transfer using BQ Data Transfer: The BigQuery Data Transfer Service is a fully-managed, cloud-native data integration service that allows you to schedule and automate the transfer of data from a variety of sources, including on-premises databases like Teradata, to BigQuery.. This approach can be suitable for one-time or periodic data transfers, and is particularly useful for cases where you need to automate the process of migrating partitioning and clustering data in BigQuery.
- ETL (extract, transform, load) using Cloud Dataflow: Google Cloud Dataflow is a fully-managed, cloud-native data processing service that allows you to build and execute data pipelines for ETL, batch processing, and stream processing. Dataflow can be a useful tool for data migration, particularly for larger data sets or for cases where more complex transformations are required. Overall, using Cloud Dataflow for a Teradata to BigQuery migration allows you to build and execute a fully-managed data pipeline that can handle the ETL or ELT requirements of your migration project.
Overall, the process of migrating data from Teradata to BigQuery requires careful planning and execution to ensure a successful transfer of data.
Approach-1: BigQuery Data Transfer Service
Teradata Instance Setup
It is possible to set up a Teradata instance on Google Cloud for testing purposes. If you have Teradata running on on-prem server or any other cloud environment, you may need to set up a network and firewall rules for the instance. This may involve creating a virtual private cloud (VPC) and configuring the necessary firewall rules to allow incoming traffic to the instance.
Create a Service Account for the Data Transfer
# Set the PROJECT variable
export PROJECT=$(gcloud config get-value project)
# Create a service account
gcloud iam service-accounts create td2bq-transfer
BigQuery and Cloud Storage admin role binding
# Set TD2BQ_SVC_ACCOUNT = service account email
export TD2BQ_SVC_ACCOUNT=td2bq-transfer@${PROJECT}.iam.gserviceaccount.com
# Bind the service account to the BigQuery Admin role
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:${TD2BQ_SVC_ACCOUNT} \
--role roles/bigquery.admin
# Bind the service account to the Storage Admin role
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:${TD2BQ_SVC_ACCOUNT} \
--role roles/storage.admin
Create the Teradata GCE instance
# Create the Teradata GCE instance
gcloud compute instances create teradata-vantage-express \
--zone=us-central1-a \
--machine-type=n2-custom-4-8192 \
--create-disk=boot=yes,device-name=ve-disk,image-project=ubuntu-os-cloud,image-family=ubuntu-2004-lts,size=70,type=pd-balanced \
--enable-nested-virtualization \
--tags=ve
--service-account=${TD2BQ_SVC_ACCOUNT}
Follow the Teradata official guide to run Vantage Express on Google Cloud
https://quickstarts.teradata.com/vantage.express.gcp.html
Start Teradata VM on Compute Engine Instance
# ssh to the VM
gcloud compute ssh teradata-vantage-express --zone=us-central1-a
# switch to root user
sudo -i
export VM_IMAGE_DIR="/opt/downloads/VantageExpress17.20_Sles12"
DEFAULT_VM_NAME="vantage-express"
VM_NAME="${VM_NAME:-$DEFAULT_VM_NAME}"
vboxmanage createvm - name "$VM_NAME" - register - ostype openSUSE_64
vboxmanage modifyvm "$VM_NAME" - ioapic on - memory 6000 - vram 128 - nic1 nat - cpus 4
vboxmanage storagectl "$VM_NAME" - name "SATA Controller" - add sata - controller IntelAhci
vboxmanage storageattach "$VM_NAME" - storagectl "SATA Controller" - port 0 - device 0 - type hdd - medium "$(find $VM_IMAGE_DIR -name '*disk1*')"
vboxmanage storageattach "$VM_NAME" - storagectl "SATA Controller" - port 1 - device 0 - type hdd - medium "$(find $VM_IMAGE_DIR -name '*disk2*')"
vboxmanage storageattach "$VM_NAME" - storagectl "SATA Controller" - port 2 - device 0 - type hdd - medium "$(find $VM_IMAGE_DIR -name '*disk3*')"
vboxmanage modifyvm "$VM_NAME" - natpf1 "tdssh,tcp,,4422,,22"
vboxmanage modifyvm "$VM_NAME" - natpf1 "tddb,tcp,,1025,,1025"
vboxmanage startvm "$VM_NAME" - type headless
vboxmanage controlvm "$VM_NAME" keyboardputscancode 1c 1c
Download the Teradata client and jar
- Navigate to the Teradata Tools and Utilities downloads page
- Click on the file you want to download
- Copy the download URL from browser and run
curl
command as:
# Download file
curl -o FILE_NAME "URL"
# Extract the file
tar -xvf TeradataToolsAndUtilitiesBase__ubuntu_x8664.17.20.08.00.tar.gz
After extracting utility file, install tptbase
to use tbuild
utility
# Change directory
cd TeradataToolsAndUtilitiesBase
# Install
./setup.sh tptbase
Follow the same steps to download Teradata JDBC Drivers from the Teradata JDBC driver download page
- Unzip the file in home directory to get
terajdbc4.jar
Install the Transfer Agent
The agent is part of the BigQuery Data Transfer Service which automates the movement of data between Teradata and BigQuery
# Install Java
sudo apt-get install openjdk-8-jre-headless
# Download transfer agent
sudo wget -P /root/ \
https://storage.googleapis.com/data_transfer_agent/latest/mirroring-agent.jar
ssh to VM and query database
# ssh to Vantage Express VM. Use root as password
ssh -p 4422 root@localhost
# Validate if the DB is up
pdestate -a
Once Vantage Express is up and running, start bteq
client command line client. When asked for the password, enter dbc
>> bteq
Enter your logon or BTEQ command:
.logon localhost/dbc
Password: dbc
Run Test Queries
Initially, there won’t be any database or tables present. You need to import tables from local files or create table using SQL and insert data manually. Create the environment structure as per your requirements in this step.
SELECT * FROM HR.Employees;
# Close BTEQ
.QUIT
Create Transfer using BigQuery Console
In the Google Cloud console,
- Go to the BigQuery page
- Click
Data transfers
- Click
Create transfer
Provide the appropriate information as per the requirements.
Provide the destination and source details:
- Destination dataset
- Source database type. Select
Teradata.
- Provide Cloud Storage bucket path. This bucket will be used to store intermediate data during the migration process.
- Source database name
- Table name patterns
- To select specific tables, provide pipe separated table names astable1|table2
- Specify.*
to process all the tables in source database - For Service account email, enter the email attached to the IAM service account.
- For the Schema file path option, provide the path to the Custom Schema file. If you don’t want to use one, keep it empty and BigQuery will automatically detects the table schema.
- Click Save
This will successfully create the transfer. You can view the details such as Display Name, Schedule, Destination Dataset, etc on the transfer details page.
- On the Transfer details page, click the Configuration tab.
- Keep a note of resource name as we’ll use it to run the migration agent in next steps.
Custom Schema File
Refer below json as a sample schema file. You can specify source database name under originalName
and the target database under name
after migration. Same goes for all the columns in the table. In addition, you can specify target datatype for a column and usageType.
Specify DEFAULT as usage type. This usageType indicates that the column has no special use in the source system. This is the default value.
{
"databases": [
{
"name": "hr",
"originalName": "hr",
"tables": [
{
"name": "employees",
"originalName": "employees",
"columns": [
{
"name": "ID",
"originalName": "ID",
"type": "INT64",
"originalType": "integer",
"usageType": [
"DEFAULT"
]
},
{
"name": "NAME",
"originalName": "NAME",
"type": "STRING",
"originalType": "character",
"usageType": [
"DEFAULT"
]
},
{
"name": "UPDATE_DATE",
"originalName": "UPDATE_DATE",
"type": "DATE",
"originalType": "date",
"usageType": [
"PARTITIONING"
]
},
{
"name": "DEPARTMENT_CODE",
"originalName": "DEPARTMENT_CODE",
"type": "STRING",
"originalType": "character",
"usageType": [
"DEFAULT"
]
}
]
}
]
}
]
}
Incremental Updates
- To handle incremetnal updates, you can specify COMMIT_TIMESTAMP as
usageType
and supportes data types for the same are TIMESTAMP or DATE. As of today, we can specify only one column per table with this usage type. This column is used to extract rows created/updated since the last transfer run.
Partitioned Table
- If you have partitioned table to migrate, you can specify PARTITIONING as
usageType
and supportes data types for the same are TIMESTAMP or DATE. This column is used in the partitioned table definition for the containing tables object.
Clustering
- If you have partitioned table to migrate, you can specify CLUSTERING as
usageType
We can annotate up to four columns with this usage type. The column order for clustering is determined based on the order in which they appear in the custom schema.
Initialize the agent
Switch back to Teradata terminal. Initialize the bigquery migration agent which follows the form:
java -cp terajdbc4.jar:mirroring-agent.jar com.google.cloud.bigquery.dms.Agent - initialize
When prompted, configure the following options:
- Provide directory for locally extracted files
localhost
as Database hostname- Provide connection port or enter to use default port
- Type
yes
to use TPT to unload data - Provide path to database credential file or enter to provide credentials during runtime. Sample contents of the credential file:
username=dbc
password=dbc
- Provide the resource name copied from BigQuery Transfer Details Console in the config name.
- Finally, the configuration file path.
root@teradata-vantage-express:~/export_data# java -cp ~/mirroring-agent.jar com.google.cloud.bigquery.dms.Agent - initialize
Reading data from gs://data_transfer_agent/latest/version
Would you like to write the default TPT template to disk? (yes/no): yes
Enter path to write default TPT script template: tpt_template
Enter directory path for locally extracted files: /tmp
Enter Database hostname: localhost
Enter Database connection port (press enter to use default port):
Would you like to use Teradata Parallel Transporter (TPT) to unload data? Recommended - yes. (yes/no): yes
Enter database credentials file path:
Enter BigQuery Data Transfer Service config name (press enter to create new transfer config): projects/project-id/locations/us/transferConfigs/639641d3–0000–2061–9b3a-240588725d64
Enter configuration file path: config.json
Configuration file successfully created!
With above details, initialization process will create two files: tpt_template
and config.json
Step 4: Run the agent
java -cp ~/mirroring-agent.jar:terajdbc4.jar com.google.cloud.bigquery.dms.Agent --configuration-file=config.json
If prompted, provide Teradata username and password.
If all the provided configurations are correct, agent will list the specified tables and transfer them to destination dataset in BigQuery.
Approach-2: Using TPT Export Utility + gsutil
In general, any data transfer can be categorise into ETL process or ELT process. If we don’t want to perform any transformations to the source data, BigQuery Data Transfer Service is the go to option.
If your use-case needs any transformations or pre-processing to be applied on source data, using Cloud Storage as a staging area and performing transformations via Dataflow could be one of the solution.
Exporting Data out of Teradata
Teradata provides two data export utilities capable of sending data from a Teradata platform:
- BTEQ and
- FastExport
Teradata Parallel Transporter Utility (TPT) merges the functionalities of Fastload, Multiload, TPUMP, BTEQ, and Fastexport into one utility.
Process
Define a TPT script
There are 3 components in the script
- Define
EXPORT_DATA
job. This includes type of operator, Consumer_Opp in our case, schema and output attributes such as output filename, delimiter, mode, etc. - Then comes schema for the table that needs to be exported.
- Lastly, the Producer_Export operator. Here we define the source db credentials, teradata process id, session info, etc. You have to provide
SelectStmt
here which can be modified to createfull-load
as well asincremental-load
Sample td_export.tpt
file
DEFINE JOB EXPORT_DATA
(
DEFINE OPERATOR Consumer_Opp
DESCRIPTION 'Consumer Operator'
TYPE DATACONNECTOR CONSUMER
SCHEMA *
ATTRIBUTES
(
VARCHAR FileName='employee.csv',
VARCHAR FORMAT= 'DELIMITED',
VARCHAR OpenMode='Write',
VARCHAR TextDelimiter = ','
);
DEFINE SCHEMA Employee_Schema
(
GlobalID INTEGER,
FirstName VARCHAR(30),
LastName VARCHAR(30),
DateOfBirth CHAR(10),
JoinedDate CHAR(10),
DepartmentCode BYTEINT
);
DEFINE OPERATOR Producer_Export
TYPE EXPORT
SCHEMA Employee_Schema
ATTRIBUTES
(
VARCHAR UserName='dbc',
VARCHAR UserPassword='dbc',
VARCHAR SelectStmt = 'select * from HR.EMPLOYEES;',
VARCHAR Tdpid='localhost',
INTEGER MaxSessions=5,
INTEGER minsessions=1,
VARCHAR DateForm='ANSIDATE'
);
APPLY
TO OPERATOR( Consumer_Opp )
SELECT * FROM OPERATOR( Producer_Export[2] );
);
Run the script:
tbuild -f td_export.tpt
- Outputs a CSV file on local server
Use gsutil to export local files to GCS:
gsutil cp *.csv gs://bucket-name/dest-folder
After this step, there can be two ways to load data into BigQuery.
Route-1: Load directly into BQ
To load data directly from a Cloud Storage data source into BigQuery use bq load
command by providing source data format, gcs file path and output schema file.
bq load \
--source_format=CSV \
td2bq.employees_gcs \
gs://bucket-name/dest-folder/employee.csv \
./employee_schema.json
Route-2: Using Dataflow
If you need to perform any data cleaning or transformation on source data before loading into BigQuery, you can use Dataflow. Dataflow is a serverless, fast and cost-effective service that supports both stream and batch processing.
Process:
- Enable the dataflow API
- Dataflow uses cloud storage bucket as a staging location to store temporary files. Create GCS folder for staging and temp data <gs://bucket-name/GCS_to_BQ_Dataflow/temp> & <gs://bucket-name/GCS_to_BQ_Dataflow/staging>
- Go to Google Cloud console
- Click on
Activate Cloud Shell
- Install apache beam dependency
sudo pip3 install apache_beam[gcp]
- Create
quality_check.py
as shown below:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
FILEPATH = 'gs://bucket-name/employee.csv'
def calculate_age(data):
from datetime import date, datetime
today = date.today()
dob = datetime.strptime(data['date_of_birth'], '%Y-%m-%d')
data['age'] = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
return data
def run():
with beam.Pipeline(options=PipelineOptions()) as pipeline:
(pipeline | 'Read Data' >> beam.io.ReadFromText(FILEPATH)
| beam.Map(lambda x: x.split(','))
| 'Convert to Dictionary' >> beam.Map(
lambda x: {"id": x[0], "first_name": x[1], "last_name": x[2], "date_of_birth": x[3],
"date_of_joining": x[4], "department_code": x[5]})
| 'CalculateAge' >> beam.Map(calculate_age)
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
project='project-id',
dataset='td2bq',
table='emp_data',
schema='id:INTEGER,first_name:STRING,last_name:STRING,date_of_birth:STRING,'
'date_of_joining:STRING,' 'department_code:INTEGER,age:INTEGER',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
if __name__ == '__main__':
run()
Above is a sample python script which reads data from GCS bucket, populates age
field from date_of_birth
and writes back this data into destination BigQuery table with the new schema.
You can modify above script to apply data cleansing, transformations, validations and other data quality checks.
- Run the Dataflow Job
python3 quality_check.py \
--runner DataFlowRunner \
--project jarvis-365810 \
--temp_location gs://bucket-name/GCS_to_BQ_Dataflow/temp \
--staging_location gs://bucket-name/GCS_to_BQ_Dataflow/staging \
--region us-central1 \
--job_name gcs2bq
In the Google Cloud console:
- Go to Dataflow
- Click on
Jobs
- From the
Job details
page click on the job name and track the progress.
On the successful completion of the job, table emp_data
will be created under dataset td2bq
in BigQuery. Check the schema & preview the table to check the correctness of data.
Translate queries
With BigQuery, you can translate SQL queries in two different ways:
- Using batch SQL translation to migrate your SQL scripts in bulk
- interactive SQL translation to translate ad hoc queries.
1. Using Batch SQL Translation
In the Google Cloud console,
- Go to the BigQuery page
- Under
Migration
click onSQL Translation
- Click
Start Translation
Specify the translation configuration as follows:
In Source details section, provide location of GCS folder which has all the Teradata compatible SQL scripts that needs to be translated to BigQuery SQL.
Similarly, in Target details section, provide destination path for the output scripts.
In the Optional settings,
- You can provide the default database name. When source query does not specify one, it picks this name.
- You can also provide source-destination name mapping via json file
- Or, with manual input, you can provide name mapping pairs
Once you click Create
, a translation job will be created and triggered at the same time. You can check the progress and translation details by clicking on the job name.
Once the job is completed, you can check the destination GCS path provided in Target details section. In the output folder, you’ll find all the BigQuery compatible SQL scripts if job ran successfully for all the files.
The batch SQL translator outputs the following files to the specified destination:
- The translated files.
- The translation summary report in CSV format.
- The consumed output name mapping in JSON format.
2. Interactive SQL Translation
In the Google Cloud console,
- Go to the BigQuery page
- Go to
SQL Workspace
- In the query editor, click
MORE
and selectEnable SQL Translation
- Window will split into two. On the left editor window, click
Translating from
and select source data warehouseTeradata
- Paste the Teradata SQL query in the editor. Click
Translate
which will give you the equivalent BigQuery Standard SQL query.
Further Steps
- Migrate business applications
- Migrate data pipelines
- Optimize performance
- Verify and validate
If you enjoyed reading this post or found it useful in any way, please drop some claps and comments. I would love to hear your feedback and encouragement to write more on Data Engineering.
Feel free to engage and connect with me on LinkedIn