Batch Processing of data from MYSQL to HDFS using Oozie workflow
Encountered with a challenge of automating the process of data collection from MySQL server to HDFS using Oozie workflow. I was puzzled as to which method to implement for achieving this goal. After referring to multiple sources on the Internet, I was unsuccessful in finding the right procedure as the information was sprinkled all over and not in a single source. So, I came up writing this blog on how to fetch data from MYSQL to HDFS using Oozie workflow with sqoop action using incremental append.
Steps:
1. Install Sqoop metastore using the below command in command prompt.
a. Command: yum install sqoop-metastore
2. Once you installed, please check the following folder to find the sqoop-metastore
a. Path: /usr/bin/sqoop-metastore
3. Then start the sqoop metastore service
a. Command: sqoop-metastore start
4. Now you have to modify the sqoop-site.xml to enable as centralized meta store to store the last incremental value.
a. Path: /etc/sqoop/conf/sqoop-site.xml
5. Modify the entire structure based on the following
6. Create a sqoop job now with incremental import
a. Command:
sqoop job — meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop
— create hiveincr — import — connect
jdbc:mysql://xxx/indschema — username testdev — password testdev
— table salesdata — check-column
recordid — incremental append — last-value 0 — target-dir
/user/cloudera/salesdata — fields-terminated-by “,”
b. The above command will execute perfectly for the first time but however it will throw an error for concurrent times since the password won’t be stored.
c. Error: SQL Connection: Communication failure
d. So we will create a password file (password in it) in hdfs and configure the file in the job
e. Command:
sqoop job — meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop
— create hiveincr — import — connect
jdbc:mysql://xxx/indschema — username testdev — password-file
hdfs:///user/cloudera/pwd.password — table salesdata — check-column
recordid — incremental append — last-value 0 — target-dir
/user/cloudera/salesdata — fields-terminated-by “,”
xxx-> Indicate the localhost where MYSQL server installed
7. Then create a sqoop workflow with arguments
References:
1. Apache Oozie By Mohammad Kamrul Islam and Aravind Srinivasan
2. Hadoop Real World Solutions Cookbook by Tanmay Deshpande
3. https://issues.apache.org/jira/projects/SQOOP/issues/SQOOP-2851
Thanks for reading!!