A Step by Step Guide for Loading Oracle Datasets into Hadoop using Docker Containers

Achir Youssef
10 min readMay 11, 2020

--

Introduction

Hadoop is a powerful Big Data ecosystem that offers many tools and services to mainly store large volumes of data, manage it and process it.

Usually in production, data engineers would run a Hadoop cluster on powerful servers, this begs the following question, How can a student or any advent user willing to get their hands on this technology, run it on a testing environment using commodity hardware ?

One of the immediate answers that would cross our minds are Virtual Machines, while this can be a decent solution in the appropriate environment, It would still be a challenging hurdle for low end hardwares because of it’s demanding requirements, particularly for the more overwhelming tasks that would need running many instances at once, such as ensuring the communication between Hadoop and the services of a relational database like Oracle, while both are residing in two different VMs.

This is where the Docker Containers Magic kicks in, and this article will guide you thoroughly on setting up your Hadoop and Oracle containers to transfer your relational database datasets into HDFS using Sqoop.

Setting up the Hadoop Docker Container

Before getting started, we will need to install Docker on our machine, this can be done easily by visiting the official Docker download page on the following link.

https://docs.docker.com/get-docker/

Next, to setup any docker container, we are going to need the appropriate docker image file that acts as a blueprint for the container we wish to run, we could either create our own docker images or use pre existing ones shared publicly by the community on the Docker Hub which you can visit via this link (Note that a registered account is necessary).

https://hub.docker.com/

During this case study, I’ll be using the cloudera quickstart image, It offers a pre configured single node deployment of Hadoop which is ideal for testing and learning purposes, the link to this image can be reached via this url.

https://hub.docker.com/r/cloudera/quickstart

Now, we should be ready to start setting up the container, first of all we will download the cloudera quickstart image, make sure that docker is up and running on your machine and execute the following command.

docker pull cloudera/quickstart:latest

The docker pull command will simply look for the image relative to that name in the docker hub and will start downloading it on your local machine, as for the latest tag, it indicates that we want to retrieve the latest version of this image (The available tags can be checked on Docker Hub)

Then, to make sure that the image was downloaded successfully, we could use the following command to list all of the available ones.

docker images

In the results, you should see among the images list, the cloudera/quickstart image as shown below :

Overview of the docker images command

In the following step we are going to create the container, this can be done by running the following command.

docker run --name cloudera --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 -p 80:80 -p 8020:8020 -p 7180:7180 cloudera/quickstart /usr/bin/docker-quickstart

Some of the most crucial flags and options used in the above command can be explained as follows :

  • name : Defines the container’s tag name, very helpful when using docker commands to change the container’s configuration.
  • hostname : Defines the container’s hostname.
  • p : Maps the guest’s port to a host’s port (e.g : -p 8888:8888 maps the Hue UI’s port on the guest container to the host’s machine 8888 port, then to visit the Hue interface on your host, simply go to : http://localhost:8888).
  • cloudera/quickstart : Refers to the image’s name.

After running the below command, we can list all of the running containers.

docker ps
Overview of the docker ps command

Note : To start an already available container we could either use the Docker Desktop Application on Windows or the following command.

docker start -ai [CONTAINER_NAME_OR_ID]
Overview of the Docker Desktop Application

Well done, you have set up your Hadoop’s container successfully using the cloudera/quickstart image, you are now able to efficiently use Hadoop’s powerful services and commands, let us check what are the running Hadoop processes by using the jps command inside our newly set up container (jps Stands for : Java Virtual Machine Process Status Tool).

Overview of the JPS Command

Setting up the Oracle DataBase Container

In this part, I will briefly present how to set up the other container we will need for this case study, an Oracle Container based on the official Oracle Entreprise 12c image which can be found in the following link.

https://hub.docker.com/_/oracle-database-enterprise-edition

By following the exact steps previously mentionned, we are going to start by downloading the image with the following command.

docker pull store/oracle/database-enterprise:12.2.0.1

Then we use the docker run command to create the container (We don’t need to map any ports in this case).

docker run -d -it --name Oracle-DB store/oracle/database-enterprise:12.2.0.1

Again, we can check if these commands were executed successfully.

Overview of the above commands

Congratulations, the necessary containers are well set up, we can now move on to our main objective that consists of transfering the SQL datasets from an Oracle database into Hadoop using the tool known as Sqoop, but before that we willl need the actual data by preparing an Oracle Database with a single table.

Preparing the Oracle datasets

To begin, we will start the Oracle container, then open a shell terminal by running the following command.

docker exec -it --user root Oracle-DB /bin/bash

A short description of this command :

  • docker exec : Allows running a command in an available container
  • user : This flag is used to specify the user executing the command
  • Oracle-DB : Refers to the container’s name we specified while setting it up.

Afterwards, we will create a directory called scripts inside the home folder, this is going to be used for storing our SQL scripts, which we will write later in order to create our table, directly after, we will give full permissions to this directory.

Overview of the Scripts directory creation

Next, we are going to start Oracle’s SQL*PLUS command line interface by using docker exec , then we are going to log on into the default system administrator’s account (You can check the official page on Docker hub for more details).

docker exec -it [CONTAINER_NAME_OR_ID] bash -c "source /home/oracle/.bashrc; sqlplus /nolog"
Overview of starting SQL*PLUS

Then, we will create a new user and grant him the permissions needed to start creating tables and populating them with data.

Overview of New User creation
Overview of granting permissions

In this next step, we are going to create a simple SQL table, first we need to write it’s creation script, for this, I will be using the edit command to create a new file in the previously set up scripts directory.

Note : To use an impoved version of VI that offers syntax highlighting, you have to install it as it’s not pre-built in the container, you will also need to change the default editor of SQL*PLUS.

Overview of making the table’s creation file

Right below is the table’s creation script.

create table PRODUCTS(                                                                                                                                                      id NUMBER PRIMARY KEY,                                                                                                                                                      name VARCHAR(15) NOT NULL,                                                                                                                                                  price NUMBER NOT NULL                                                                                                                                                       );                                                                                                                                                                          commit;

Let us insert some records into this table, again I start by creating a new file in the scripts folder.

Overview of making the data insertion file
insert into PRODUCTS values(1, 'product1', 100);
insert into PRODUCTS values(2, 'product2', 200);
insert into PRODUCTS values(3, 'product3', 300);
commit;

Finally we will execute both of these scripts in SQL*PLUS.

Overview of executing the table’s creation script
Overview of executing the data insertion script

Voilà! The data is ready and we can jump into our Hadoop container, inside of which we will use Sqoop to transfer these datasets directly into HDFS.

Importing Oracle’s SQL datasets into Hadoop using Sqoop

As you may have guessed it by following along, the tool we are going to use to transfer our data is Sqoop (SQL to Hadoop).

As It’s name suggests, Sqoop is a tool designed to transfer data between Hadoop and relational databases such as MySQL, SQLServer or in our case Oracle, In fact, It allows bidirectionnal transfers, in other words, you may also export data from Hadoop into a relational database.

Overview of Sqoop Operations

Internally, Sqoop runs a MapReduce job, commands entered are associated with a map task to retrieve data from external relational databases, then a reduce task will be used for placing this data into Hadoop.

To ensure the connection with the external databases, Sqoop uses various connector APIs, the most commonly used are JDBC Drivers.

Before moving on to the task at hand, I will be giving a short overview on how to use Sqoop.

There is mainly two commands offered by Sqoop to run the transfer operations.

  • sqoop import or sqoop-import : Allows transfering data from a relational database to Hadoop.
  • sqoop export or sqoop-export : Allows transfering data from Hadoop to an existing table in a relational database.

Next are the most crucial flags and options needed to be passed as arguments to sqoop :

  • — connect : Specifies the jdbc URI string.
  • — username : Provides the database username.
  • -P : Prompts the user to specify the password (Recommended).
  • — password : Provides the password directly with the command.
  • — target-dir : Specifies the target output directory.
  • — table : Specifies the SQL Table we wish to transfer.
  • — m : Specifies the number of mappers (Defaults to 4).

There is even more options we can specify when running sqoop commands to further customize the operations we wish to run (transfering data directly into Hive or Hbase, etc.), however for this case study I am going to stick to the ones above.

Now, we move on to the task at hand, so make sure that both your Hadoop and Oracle containers are running and let us start the transfering process.

As I have explained previously, Sqoop needs JDBC Drivers for It’s operations, therefore, just before running the sqoop-import command we need to add the Oracle Database connector inside our container, specifically in the /var/lib/sqoop directory.

If you need to download the connector, you have to visit the official Oracle download page via the link below.

https://www.oracle.com/database/technologies/jdbc-upc-downloads.html

In my case, I have stored the jar file in my host machine, then I’ve copied it into my Hadoop container by simply using a docker command that allows file or directory transfers between the host machine and guest containers.

docker cp [SRC_FILE][CONTAINER_ID_OR_NAME]:[DEST_FILE]
Overview of copying the driver file to the container

You may have noticed by checking the /var/lib/sqoop/ directory that a mysql connector is already available by default.

Alright, all of the prerequisites are ready, I will simply create a new directory in HDFS that will be used to store the output of my sqoop import commands.

hdfs dfs -mkdir /user/sqoop

Moving on, we can finally execute the import operation with the following command.

sqoop import --connect jdbc:oracle:thin:@172.17.0.2:1521:ORCLCDB --table PRODUCTS --username tutorial_user -P --m 1 --target-dir /user/sqoop/output_tutorial

All of the flags i’ve used were explained in an above part, however, I would like you to pay close attention to the Driver’s URI string jdbc:oracle:thin:@172.17.0.2:1521:ORCLCDB.

I am going to explain each of it’s individual fragments :

  • jdbc.oracle:thin : This is the driver’s name (This varies greatly from a database system to another.)
  • 172.17.0.2 : This is the Oracle container’s IP Address (By default every docker container newly set up is connected to the same bridge network.)
  • 1521 : The default Oracle’s listener Port.
  • ORCLCDB : The listener’s name (This varies depending on the Oracle version used, to know the listener currently running on your container use the following command : docker exec -it [CONTAINER_NAME_OR_ID] bash -c “source /home/oracle/.bashrc; lsnrctl status” ).

Viewing the results of the sqoop-import command

Last but no least, we can view the datasets imported in the output directory we’ve passed as an argument during our sqoop import, in this case we can directly see the results in HDFS storage by running the following command.

hdfs dfs -ls /user/sqoop/output_tutorial
Overview of the results inside the target directory

We can view the actual file’s content by executing a simple cat command.

hdfs dfs -cat /user/sqoop/output_tutorial/part-m-00000
Overview of the results file content

Finally, we will take a look directly at our results by visiting the Hue UI, if your port mappings are correctly set up, you can simply open it in your browser via :

http://localhost:8888/

Log in using the default cloudera account with these credentials :

  • username : cloudera
  • password : cloudera

Finally, hop into the file browser, then look for your output directory, you should be able to see your results clearly as shown below.

Overview of the results in Hue interface

Conclusion

During this article, I’ve explained in details how to set up a Hadoop and Oracle docker containers, then how to create a simple Oracle table that we used for testing purposes with Sqoop import commands.

This guide shows how effective it is to use Docker containers as an alternative to Virtual Machines, so we may have an efficiently available testing environment for powerful tools like Hadoop and it’s services.

I hope you have enjoyed your read and found this article helpful, feel free to leave any feedback or questions.

--

--