Sqoop scenarios and options

Ankit Gupta
8 min readApr 21, 2020

--

As part of the modern day big data architecture, it has become imperative to move data from RDBMS to Hadoop Distributed file system and vice versa.

RDBMS and Data Lake data transfers via Sqoop

As part of the data lake creation, we will be looking at getting data from various source systems like Oracle, Teradata etc.
We will also be looking at major issues faced while getting the data from RDBMS and how to handle them while writing your first sqoop program

This article will also discuss about writing data to HDFS and Hive

1. Getting data from Oracle and writing in HDFS

sqoop import \
--connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>"\
--username "<username>" \
--password "<password>" \
--table "<schema name>.<table name>" \
--fields-terminated-by '<delimiter>' \
--lines-terminated-by "\n" \
--target-dir "<hdfs path>" \
--verbose

connect : JDBC Connection string
username : DB username
password : Enter Password as part of sqoop command
Note: Not advisable as the password will be visible during job execution logs
Probable Solution: Save the password in a file in edge node or on HDFS and read from it at runtime
table : Table to be imported
fields-terminated-by : Specify any value like ,|: etc. By default it will import data into HDFS with comma separated values. Describe how the data will be written to HDFS(Output Formatting arguments)
lines-terminated-by : Delimiter to distinguish lines
target-dir : HDFS destination directory
verbose : to enable detailed logging

2. Getting data from Oracle and writing in HDFS with password on Prompt and for specific columns

sqoop import \
--connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>"\
--username "<username>" \
-P \
--table "<schema name>.<table name>" \
--columns "<<column1>>,<<column2>>" \
--fields-terminated-by '<delimiter>' \
--lines-terminated-by "\n" \
--target-dir "<hdfs path>" \
--verbose
Enter Password : ...

P : Prompt for password in console
columns : To get only specific columns from a table

3. Getting data from Oracle and writing in HDFS and removing special characters like \n , \t , and \01 characters

sqoop import \
--connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>"\
--username "<username>" \
-P \
--table "<schema name>.<table name>" \
--columns "<<column1>>,<<column2>>,<<column3>>,<<column4>>" \
--fields-terminated-by '<delimiter>' \
--lines-terminated-by "\n" \
--target-dir "<hdfs path>" \
--hive-delims-replacement '|' \
--map-column-java <<column3>>=String \
--verbose

hive-delims-replacement : Replace any special characters like \n, \t and \01 with some delimiter(| in above example)
map-column-java : Need to mention column on which delimiter replacement needs to be performed
hive-drop-import-delims : Use this option if you want to drop the delimiters instead of replacing them

4. Getting data from Oracle and writing in HDFS with no primary key defined on source table

Note: It is mandatory to have primary key for the table that needs to be imported

Issue: Error “No primary key could be found for table”during import occurs, if no primary key is defined

Workaround : Use the below option to overcome error if no key is defined on source table

-m 1

Note: This will set the number of mappers to 1, so no need to define a primary key for table

sqoop import \
--connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>"\
--username "<username>" \
-P \
--table "<schema name>.<table name>" \
--columns "<<column1>>,<<column2>>,<<column3>>,<<column4>>" \
--fields-terminated-by '<delimiter' \
--lines-terminated-by "\n" \
--target-dir "<hdfs path>" \
–-delete-target-dir \
--hive-delims-replacement '|' \
--map-column-java <<column3>>=String \
--optionally-enclosed-by '\"' \
--escaped-by '\' \
-m 1 \
--verbose

optionally-enclosed-by : All fields in the HDFS will be enclosed with “
escaped-by : any field with double-quotes in the data then those quotes will be escaped by a backslash
m, num-mappers : Number of mappers
delete-target-dir : Delete the import target directory if it exists
append : Append data to an existing dataset in HDFS. Use this instead of delete-target-dir

5. Getting data from Oracle and writing in HDFS with a custom query

sqoop import \
--connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>"\
--username "<username>" \
--password "<password>" \
--query "select * from sample_data where \$CONDITIONS and id>20" \
--fields-terminated-by '<delimiter' \
--lines-terminated-by "\n" \
--target-dir "<hdfs path>" \
–-delete-target-dir \
--fetch-size "50000" \
-m 10 \
--verbose

query : To provide query to database to get relevant data. \$CONDITIONS is mandatory to be provided
fetch-size : Number of records to read from database at once. This will increase performance for batch loads

Justification : Sqoop requires to access metadata of table for example column type information. Placeholder $CONDITIONS is by default set to ‘1 = 0’ to ensure that sqoop receives only type information. So, after executing sqoop command you will see first query that gets fired is with default $CONDITIONS. Later on, it is substituted by different values defining different ranges based on number of mappers (-m) or — split-by column or — boundary-query so that entire data set can be divided into different data slices or chunks and chunks can be imported in parallel with as much as concurrency available. Sqoop will automatically substitute this placeholder with the generated conditions specifying which slice of data should be transferred by each individual task

6. Getting data from Oracle and writing in HDFS with parallelism and specific queue

Note: split-by Column of the table used to split work units. Cannot be used with — autoreset-to-one-mapper option.

Additional Points for split-by column:

  1. Column should be indexed
  2. If no index, each thread has to do full table scan
  3. If indexed, it will be search the index and find the data
  4. It should not have null values
  5. If null values there , sqoop ignore those values
  6. split-by also can be used for NON-PK columns (for ex: order_status in orders table) . To split data on NON-PK columns, we should use additional property below:
    “-Dorg.apache.sqoop.splitter.allow_text_splitter=true”

Disadvantage : With splitting on NON-PK column, sometimes system generates huge number of files

The below script will import “order_items_nopk” table’s into warehouse-dir using 4 files (default — num-mappers 4) which are into different file sizes (skewed) because there is no primary key on this table means, no index on column “order_item_order_id”. To avoid this — split-by can be used, so all data copied into 4 files with even sizes.

sqoop import \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
–-connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>" \
--username "<username>" \
--password "<password>" \
–-table "order_items_nopk" \
–-warehouse-dir /user/sqoop/sqoop_import/retail_db \
–-split-by order_item_order_id

Dorg.apache.sqoop.splitter.allow_text_splitter : This option must be used in case of a non PK column used as a splitter in split-by or no split-by is mentioned and table does not have a PK column, to avoid job failure

Another example

sqoop import \
-Dmapred.job.queue.name=<Queue Name> \
--connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>"\
--username "<username>" \
-P \
--table "<schema name>.<table name>" \
--columns "<<column1>>,<<column2>>,<<column3>>,<<column4>>" \
--fields-terminated-by '<delimiter' \
--lines-terminated-by "\n" \
--target-dir "<hdfs path>" \
–-delete-target-dir \
--hive-delims-replacement '|' \
--map-column-java <<column3>>=String \
--optionally-enclosed-by '\"' \
--escaped-by '\' \
--split-by "PK Column(e.g. ITEM_NBR)" \
--boundary-query "SELECT MIN(ITEM_NBR), MAX(ITEM_NBR) FROM <schema name>.<table name>" \
--num-mappers 10 \
--verbose

Dmapred.job.queue.name : To mention the queue name to be used to execute the job
split-by : Usually choose a column which is a primary key of a table, indexed and not null
boundary-query : To choose the minimum and maximum value for split-by column and divide the data equally into mappers. Used to create splits

7. Handling data issues when importing null values from an Oracle table

sqoop import \
-Dmapred.job.queue.name=<Queue Name> \
--connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>"\
--username "<username>" \
-P \
--table "<schema name>.<table name>" \
--columns "<<column1>>,<<column2>>,<<column3>>,<<column4>>" \
--fields-terminated-by '<delimiter' \
--lines-terminated-by "\n" \
--target-dir "<hdfs path>" \
–-delete-target-dir \
--hive-delims-replacement '|' \
--map-column-java <<column3>>=String \
--optionally-enclosed-by '\"' \
--escaped-by '\' \
--split-by "PK Column(e.g. ITEM_NBR)" \
--boundary-query "SELECT MIN(ITEM_NBR), MAX(ITEM_NBR) FROM <schema name>.<table name>"
--null-string '\\\\N' \
--null-non-string '\\\\N' \
--num-mappers 10 \
--verbose

null-string : hive stores any NULL value as “\n “ in HDFS and is read properly when querying in a table. If this option is not provided, null is stored as “null” string and query does not return correct results
null-non-string : hive stores any NULL value as “\n “ in HDFS for non string columns

8. Saving data in HDFS in different file formats and with compression

Note: By default sqoop imports data as plain text, or can explicitly mention the option of as-textfile

sqoop import \
-Dmapred.job.queue.name=<Queue Name> \
--connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>"\
--username "<username>" \
-P \
--table "<schema name>.<table name>" \
--columns "<<column1>>,<<column2>>,<<column3>>,<<column4>>" \
--fields-terminated-by '<delimiter' \
--lines-terminated-by "\n" \
--target-dir "<hdfs path>" \
–-delete-target-dir \
--as-parquetfile \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--hive-delims-replacement '|' \
--map-column-java <<column3>>=String \
--optionally-enclosed-by '\"' \
--escaped-by '\' \
--split-by "PK Column(e.g. ITEM_NBR)" \
--boundary-query "SELECT MIN(ITEM_NBR), MAX(ITEM_NBR) FROM <schema name>.<table name>"
--null-string '\\\\N' \
--null-non-string '\\\\N' \
--num-mappers 10 \
--verbose

as-parquetfile : Save the data as parquet file
as-avrodatafile : Save the data as avro file
as-sequencefile : Save the data as sequence file
as-textfile : Save as plain text(default)
compress : To enable compression of data
compression-codec : Use Hadoop codec (default to gzip)

9. Save imported data directly as a hive table

sqoop import \
--connect "jdbc:oracle:thin:@<hostname>:<port number>/<service id>"\
--username "<username>" \
--password "<password>" \
--table "<<source schema name>>.<<source database table>>" \
--hive-import \
--hive-table <<target schema name>>.<<target database table>> \
-m 1

hive-import : Import table into hive (Uses Hive’s default delimiters if none are set.)
hive-table : Specify schema name and table name
hive-overwrite : Overwrite existing data in the Hive table
create-hive-table : Job will fail if the table already exists. By defualt it is set to false
hive-drop-import-delims : Drops \n, \r, and \01 from string fields when importing to Hive
hive-delims-replacement : Replace \n, \r, and \01 from string fields with user defined string when importing to Hive
hive-partition-key : Define partition key of hive table

10. Getting data from Teradata and writing in HDFS

sqoop import \
-libjars /usr/lib/sqoop/lib/tdgssconfig.jar,/usr/lib/sqoop/lib/terajdbc4.jar \
-Dmapred.job.queue.name=xxxxxx \
--connect jdbc:teradata://xxx.xx.xxx.xx/DATABASE=$db,LOGMECH=LDAP \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username "<username>" \
--password "<password>" \
--query

libjars : jars needed to connect to teradata
connection-manager : connection manager for teradata

Thats all for now fellow learners.

This is the first post for me on any blogging site. Hope everyone will enjoy.
Happy learning!!

--

--