Batch Processing of data from MYSQL to HDFS using Oozie workflow

Rathnaguru Venkatachalapathy
3 min readApr 6, 2020

--

Encountered with a challenge of automating the process of data collection from MySQL server to HDFS using Oozie workflow. I was puzzled as to which method to implement for achieving this goal. After referring to multiple sources on the Internet, I was unsuccessful in finding the right procedure as the information was sprinkled all over and not in a single source. So, I came up writing this blog on how to fetch data from MYSQL to HDFS using Oozie workflow with sqoop action using incremental append.

Steps:

1. Install Sqoop metastore using the below command in command prompt.

a. Command: yum install sqoop-metastore

2. Once you installed, please check the following folder to find the sqoop-metastore

a. Path: /usr/bin/sqoop-metastore

3. Then start the sqoop metastore service

a. Command: sqoop-metastore start

4. Now you have to modify the sqoop-site.xml to enable as centralized meta store to store the last incremental value.

a. Path: /etc/sqoop/conf/sqoop-site.xml

5. Modify the entire structure based on the following

6. Create a sqoop job now with incremental import

a. Command:
sqoop job — meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop

— create hiveincr — import — connect

jdbc:mysql://xxx/indschema — username testdev — password testdev
— table salesdata — check-column

recordid — incremental append — last-value 0 — target-dir

/user/cloudera/salesdata — fields-terminated-by “,”

b. The above command will execute perfectly for the first time but however it will throw an error for concurrent times since the password won’t be stored.

c. Error: SQL Connection: Communication failure

d. So we will create a password file (password in it) in hdfs and configure the file in the job

e. Command:
sqoop job — meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop

— create hiveincr — import — connect

jdbc:mysql://xxx/indschema — username testdev — password-file

hdfs:///user/cloudera/pwd.password — table salesdata — check-column

recordid — incremental append — last-value 0 — target-dir

/user/cloudera/salesdata — fields-terminated-by “,”

xxx-> Indicate the localhost where MYSQL server installed

7. Then create a sqoop workflow with arguments

References:

1. Apache Oozie By Mohammad Kamrul Islam and Aravind Srinivasan

2. Hadoop Real World Solutions Cookbook by Tanmay Deshpande

3. https://issues.apache.org/jira/projects/SQOOP/issues/SQOOP-2851

Thanks for reading!!

--

--