Data Ingestion with Apache SQOOP

Shaharyar Azam
2 min readAug 14, 2020

--

Data Movement from RDBMS to Hadoop Distributed File System (HDFS) and vice versa

OVERVIEW

Enormous amount of Data Movement with big data is an integral part of data lake and DWH formation. There are numerous tools available to transfer data from one source to another. Although we have tools to transfer real time data but for the historical and legacy data migration we need a tool that can easily transfer TB’s of data.

Apache SQOOP is clearly outshining in Data ingestion of TB’s of data from RDBMS to Hadoop Distributed File System (HDFS) and vice versa.

ARCHITECTURE

Apache SQOOP work on Map-Reduce Architecture which distributes data movement task to multiple worker nodes which speed up the ingestion. SQOOP writes the data in the form of files in HDFS.

Example of distributing a task into 4 mappers into 4 files in Hadoop Distributed File System (HDFS).

Lets do some practice

SQOOP Import command to ingest data from RDMS(Oracle) to Hadoop Distributed File System (HDFS)

sqoop import \

This command is straight forward (a keyword of sqoop and import) to import the data

connect jdbc:oracle:thin:@server:port/service \

Here we have to give credentials of source like server IP, Port, Service Name etc.

-m 4 \

Here, we specify number of mappers (by default m=4) which distribute our task to the same number of worker nodes.

username your_user \
password your_password \

Source credentials (username and password)

split-by primary_key

Here, we have to specify the primary key of source table on which data split into parts (Remember that primary key must be integer type)

as-textfile \

It is optional, to which format your data would be saved

target-dir /your/HDFS/Path \

It is the target directory of HDFS where files would be created.

append \

It is the append mode. Other mode is overwrite

fields-terminated-by “\021”

It is the field delimiter. you can specify yours like (‘ \t’ ,‘\n’ etc)

CONCLUSION

Apache SQOOP really helps out in data ingestion specially when your data is in huge amount.

Official documentation can really helps you out. Here is the link below.

https://sqoop.apache.org/docs/1.4.6/index.html

--

--