Data Ingestion with Apache SQOOP
OVERVIEW
Enormous amount of Data Movement with big data is an integral part of data lake and DWH formation. There are numerous tools available to transfer data from one source to another. Although we have tools to transfer real time data but for the historical and legacy data migration we need a tool that can easily transfer TB’s of data.
Apache SQOOP is clearly outshining in Data ingestion of TB’s of data from RDBMS to Hadoop Distributed File System (HDFS) and vice versa.
ARCHITECTURE
Apache SQOOP work on Map-Reduce Architecture which distributes data movement task to multiple worker nodes which speed up the ingestion. SQOOP writes the data in the form of files in HDFS.
Lets do some practice
sqoop import \
This command is straight forward (a keyword of sqoop and import) to import the data
connect jdbc:oracle:thin:@server:port/service \
Here we have to give credentials of source like server IP, Port, Service Name etc.
-m 4 \
Here, we specify number of mappers (by default m=4) which distribute our task to the same number of worker nodes.
username your_user \
password your_password \
Source credentials (username and password)
split-by primary_key
Here, we have to specify the primary key of source table on which data split into parts (Remember that primary key must be integer type)
as-textfile \
It is optional, to which format your data would be saved
target-dir /your/HDFS/Path \
It is the target directory of HDFS where files would be created.
append \
It is the append mode. Other mode is overwrite
fields-terminated-by “\021”
It is the field delimiter. you can specify yours like (‘ \t’ ,‘\n’ etc)
CONCLUSION
Apache SQOOP really helps out in data ingestion specially when your data is in huge amount.
Official documentation can really helps you out. Here is the link below.