Performance Tuning Apache Sqoop

Six definite ways to improve efficiency and reduce load times

Thomas George Thomas
The Startup
Published in
5 min readSep 6, 2020

--

Photo by Jia Ye on Unsplash

Sqoop is a tool offered by the Apache foundation that is used commonly in the Big Data world to import-export millions of records between heterogeneous relational databases (RDBMS) and Hadoop Distributed File System (HDFS). This data transfer can lead to varying load times ranging from a couple of minutes to multiple hours. This scenario is when Data engineers worldwide look under the hood to fine-tune settings. The goal of performance tuning is to get more data loaded in a shorter time, thus increasing efficiency and lessening the chance of data loss in case of network timeouts.

In General, performance tuning in Sqoop can be achieved by:

  • Controlling Parallelism
  • Controlling Data Transfer Process

Controlling Parallelism

Photo by Meta Studio 35 on Unsplash

Sqoop works on the MapReduce programming model implemented in Hadoop. Sqoop imports exports data from most relational databases in parallel. The number of map tasks per job determines it’s parallelism. By controlling the parallelism, we can handle the load on our databases and hence its performance. Here are a couple of ways in Sqoop jobs to exploit parallelism:

Changing the number of mappers

Typical Sqoop jobs launch four mappers by default. To optimise performance, increasing the map tasks (Parallel processes) to an integer value of 8 or 16 can show an increase in performance in some databases.

By using the -m or --num-mappers parameter we can set the degree of parallelism in Sqoop. Changing the number of mappers to 10 for example:

sqoop import  
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--num-mappers 10

A few things to keep in mind is that the number of map tasks should be less than the maximum number of parallel database connections possible. The increase in the degree of parallelism should be lesser than that which is available within…

--

--

Thomas George Thomas
The Startup

Data Analytics Engineering Graduate Student at Northeastern. Ex Senior Data Engineer & IBM Certified Data Scientist. https://thomasgeorgethomas.com