Enabling Teradata Connector for Hadoop (TDCH) on Google Cloud Dataproc Cluster
Mansi Maharana, Dinesh Hiripitiyage (Teradata)
Google Cloud Dataproc is a fully managed Google Cloud service that lets you take advantage of open source tools such as Apache Hadoop, Apache Spark, Presto and many others for batch processing, querying, streaming, and machine learning. A common use case is to be able to connect Dataproc clusters to a Teradata database and facilitate the movement of small to large amounts of data depending on the use case. Teradata provides multiple connectivity options to Dataproc under the Teradata QueryGrid suite of data connectivity capabilities including Teradata Connector for Hadoop (TDCH) which will be the focus of this blog post.
TDCH is a Hadoop-based data integration tool that supports high performance parallel bi-directional data movement and transformations between Teradata vantage systems and various Hadoop ecosystem components. It provides a rich set of scalable built-in processing capabilities for customers to execute complex data integration processes with peak data volumes.
In this blogpost we will learn about how to install, configure and use the TDCH software on Dataproc along with highlighting compatibility considerations and best practices.
Versions
As a prerequisite, it’s important to know the supported versions of both TDCH and Dataproc. To support multiple varieties of Hadoop distributions, TDCH has 5 active branches — 1.5.x, 1.6.x, 1.7.x,1.8.x and 1.9.x.
Only two of these branches apply to Dataproc and are as follows.
TDCH 1.5.12 is currently tested and certified to work on:
- Google Cloud Dataproc 1.4.x (Debian 10, Hadoop 2.9)
- Google Cloud Dataproc 1.4.x (Ubuntu 18.04 LTS, Hadoop 2.9)
- Google Cloud Dataproc 1.5.x (CentOS 8, Hadoop 2.10)
- Google Cloud Dataproc 1.5.x (Debian 10, Hadoop 2.10)
- Google Cloud Dataproc 1.5.x (Ubuntu 18.04 LTS, Hadoop 2.10)
TDCH 1.9.0 is currently tested and certified to work on:
- Google Cloud Dataproc 2.0.x (CentOS 8, Hadoop 3.2.2)
- Google Cloud Dataproc 2.0.x (Debian 10, Hadoop 3.2.2)
- Google Cloud Dataproc 2.0.x (Ubuntu 18.04 LTS, Hadoop 3.2.2)
The best way to determine compatibility is to read the “Google Cloud Dataproc Recommendations” section in the README file of the version you want to install available here and you can also refer to the SUPPORTLIST link for more information.
Install instructions
NOTE: TDCH needs to be installed on all nodes of a Dataproc 1.x cluster. This requirement is unique to Dataproc 1.x only.
- Download the Teradata connector rpm file for the version you want to install from here.
2. For Debian/Ubuntu,
> Use the following commands to install the TDCH rpm file on Debian/Ubuntu platforms:
$ sudo apt-get install alien
$ sudo alien teradata-connector-<version>….rpm
> Copy the above generated .deb file (TERADATA_CONNECTOR_FILE) to a GCS bucket location (value of DOWNLOAD_LOCATION)
3. For CentOS,
> Copy the above rpm file (TERADATA_CONNECTOR_FILE) to a GCS bucket location (value of DOWNLOAD_LOCATION)
4. Copy the initialization action script from here to a GCS location. This is just a sample, modify it to meet your needs.
5. Use the following command to create the Dataproc cluster:
For Dataproc 1.x
REGION=<region>
CLUSTER=<cluster_name>gcloud dataproc clusters create ${CLUSTER} \
--region=${REGION} \
--initialization-actions=gs://<bucket-name-with-path>/tdch.sh \
--enable-component-gateway \
--metadata=TERADATA_CONNECTOR_FILE="<name-of-deb-or-rpm-file>" \
--metadata=DOWNLOAD_LOCATION="<gcs-bucket-path-download-location>" \
--metadata=LIB_TO_BE_ADDED="file:/usr/lib/tdch/<version>/lib" \
--metadata=TEZ_XML_FILE="/etc/tez/conf/tez-site.xml"
For Dataproc 2.x and above
REGION=<region>
CLUSTER=<cluster_name>gcloud dataproc clusters create ${CLUSTER} \
--region=${REGION} \
--initialization-actions=gs://<bucket-name-with-path>/tdch.sh \
--enable-component-gateway \
--metadata=TERADATA_CONNECTOR_FILE="<name-of-deb-or-rpm-file>" \
--metadata=DOWNLOAD_LOCATION="<gcs-bucket-path-download-location>"
To learn about using initialization actions in Dataproc click here.
The above steps are mainly focused on TDCH installation during creation of the Dataproc cluster using the initialization action. You can also run step 1 and 2 or 3 (depending on OS version) for manually installing TDCH after the cluster is created.
Note:
- Step#2 of converting rpm to deb using alien can be done externally (from the remote host).The host can be Windows (WSL)/Mac/Linux etc.
- The alien package may not be available on all platforms, so use your discretion when selecting an appropriate external host platform.
- If more automation is desired with the conversion, instead of converting the rpm to deb and uploading the deb file to the GCS bucket, you can directly upload the rpm to the GCS bucket and let the initialization script take care of the rpm-to-deb conversion based on the OS by modifying the script to meet your needs.
Executing a TDCH Job on Dataproc
- The first step is to set up the environment variables on the job submitter node. The dependent jar version should align with the versions available on the Dataproc cluster.
Note: [file://] prefix is only needed when submitting the job from a remote host.
export TDCH_JAR=/usr/lib/tdch/<version>/lib/teradata-connector-<version>.jarexport HADOOP_HOME=[file://]/usr/lib/hadoopexport HIVE_HOME=[file://]/usr/lib/hiveexport HCAT_HOME=[file://]/usr/lib/hive-hcatalogexport HADOOP_CLASSPATH=$HIVE_HOME/conf:/etc/tez/conf:$HIVE_HOME/lib/antlr-runtime-<version-on-cluster>.jar:$HIVE_HOME/lib/commons-dbcp-<version-on-cluster>.jar:$HIVE_HOME/lib/commons-pool-<version-on-cluster>.jar:$HIVE_HOME/lib/datanucleus-core-<version-on-cluster>.jar:$HIVE_HOME/lib/datanucleus-rdbms-<version-on-cluster>.jar:$HIVE_HOME/lib/hive-cli-<version-on-cluster>.jar:$HIVE_HOME/lib/hive-exec-<version-on-cluster>.jar:$HIVE_HOME/lib/hive-metastore-<version-on-cluster>.jar:$HIVE_HOME/lib/jdo-api-<version-on-cluster>.jar:$HIVE_HOME/lib/libfb303-<version-on-cluster>.jar:$HIVE_HOME/lib/libthrift-<version-on-cluster>.jar:$HIVE_HOME/lib/lz4-<version-on-cluster>.jar:$HCAT_HOME/share/hcatalog/hive-hcatalog-core-<version-on-cluster>.jar:$HADOOP_HOME/lib/hadoop-lzo-<version-on-cluster>.jar:[file://]/usr/lib/tdch/<version>/lib/tdgssconfig.jar:[file://]/usr/lib/tdch/<version>/lib/terajdbc4.jar:[file://]/usr/lib/spark/jars/avro-mapred-<version-on-cluster>-hadoop2.jar:[file://]/usr/lib/spark/jars/paranamer-<version-on-cluster>.jar:[file://]/usr/lib/tez/tez-api-<version-on-cluster>.jarexport LIB_JARS=$HIVE_HOME/lib/hive-cli-<version-on-cluster>.jar,$HIVE_HOME/lib/hive-exec-<version-on-cluster>.jar,$HIVE_HOME/lib/hive-metastore-<version-on-cluster>.jar,$HIVE_HOME/lib/libfb303-<version-on-cluster>.jar,$HIVE_HOME/lib/libthrift-<version-on-cluster>.jar,$HIVE_HOME/lib/jdo2-api-<version-on-cluster>.jar,[file://]/usr/lib/tdch/<version>/lib/tdgssconfig.jar,[file://]/usr/lib/tdch/<version>/lib/terajdbc4.jar
Note that other dependent libraries may be needed based on your requirement. You can find those on the cluster and add them to HADOOP_CLASSPATH or LIB_JAR if they are runtime dependencies.
2. Sample command to submit a TDCH job
Teradata Connector for Hadoop supports the following methods of data transfers:
- ConnectorImportTool (Teradata to HDFS/Hive/HCat)
- ConnectorExportTool (HDFS/Hive/HCat to Teradata)
To run it within a Dataproc VM,
hadoop jar $TDCH_JAR \ com.teradata.connector.common.tool.ConnectorImportTool \
-classname com.teradata.jdbc.TeraDriver \
-url jdbc:teradata://sampledbserver/DATABASE=sampledb \
-username sampleuser \
-password samplepwd \
-jobtype hive \
-fileformat parquet \
-sourcetable db.customer \
-sourcefieldnames “id,acctnum,…” \
-targettable db.cust \
-targetfieldnames “id,acctnum,…”
To run it using gcloud or other remote node,
gcloud dataproc jobs submit hadoop \
--cluster=<cluster-name> \
--region <region> \
--class=com.teradata.connector.common.tool.ConnectorImportTool \
--jars=$TDCH_JAR,$HADOOP_CLASSPATH \
-- -libjars $LIB_JARS \
-classname com.teradata.jdbc.TeraDriver \
-url jdbc:teradata://sampledbserver/DATABASE=sampledb \
-username sampleuser \
-password samplepwd \
-jobtype hive \
-fileformat parquet \
-sourcetable db.customer \
-sourcefieldnames “id,acctnum,..” \
-targettable db.cust \
-targetfieldnames “id,acctnum,…”
To learn more about the connector and available features and functionalities refer to the connector README available at this link as well as the TDCH Tutorial available at this link.
Considerations for Running TDCH in Google Cloud
A more comprehensive list of considerations are highlighted in section#9 of the README available at this link. Here we will highlight a few key items to be aware of:
- Specifically for Google Cloud,
> The firewall rules that VM instances belong to must have port 1025 open to both ingress and egress traffic to the Teradata database
> If internal.fastexport and/or internal.fastload protocols are used, their default ports 8678 and 65535 need to be open for ingress and egress
- TDCH needs to be installed on all nodes of a Dataproc 1.x cluster. This requirement is unique to only Dataproc 1.x.
- For production usage, before creating clusters, it is strongly recommended that you copy initialization actions to your own Cloud Storage bucket to guarantee consistent use of the same initialization action code across all Dataproc cluster nodes and to prevent unintended upgrades from upstream in the cluster. Refer here.
Wrap up
While most of the above steps can be performed manually, using the Dataproc initialization action lowers the maintenance overhead especially for ephemeral Dataproc clusters.
Special thanks to Mohan Talla from Teradata for all the help and support.
Hope you find this technical guidance helpful!