Performance Tuning Apache Sqoop
Six definite ways to improve efficiency and reduce load times
--
Sqoop is a tool offered by the Apache foundation that is used commonly in the Big Data world to import-export millions of records between heterogeneous relational databases (RDBMS) and Hadoop Distributed File System (HDFS). This data transfer can lead to varying load times ranging from a couple of minutes to multiple hours. This scenario is when Data engineers worldwide look under the hood to fine-tune settings. The goal of performance tuning is to get more data loaded in a shorter time, thus increasing efficiency and lessening the chance of data loss in case of network timeouts.
In General, performance tuning in Sqoop can be achieved by:
- Controlling Parallelism
- Controlling Data Transfer Process
Controlling Parallelism
Sqoop works on the MapReduce programming model implemented in Hadoop. Sqoop imports exports data from most relational databases in parallel. The number of map tasks per job determines it’s parallelism. By controlling the parallelism, we can handle the load on our databases and hence its performance. Here are a couple of ways in Sqoop jobs to exploit parallelism:
Changing the number of mappers
Typical Sqoop jobs launch four mappers by default. To optimise performance, increasing the map tasks (Parallel processes) to an integer value of 8 or 16 can show an increase in performance in some databases.
By using the -m
or --num-mappers
parameter we can set the degree of parallelism in Sqoop. Changing the number of mappers to 10 for example:
sqoop import
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--num-mappers 10
A few things to keep in mind is that the number of map tasks should be less than the maximum number of parallel database connections possible. The increase in the degree of parallelism should be lesser than that which is available within…