Some tips to run a multi-node Hadoop in Docker

Ruben
3 min readApr 22, 2018

--

Hadoop allows two installation modes, single node setup to install a pseudo cluster where each node runs in a thread and the so called multi-node setup, much more interesting, where each node is assumed to run in a separated network host.

Docker and Docker networks

The great majority of Internet tutorials address Hadoop on a single computer and a single node setup. The real deal comes of course with the multi-node setup, and thanks to Docker it’s possible to run a network where each container runs its own Hadoop node, something like this:

Hadoop multi-node setup inside a Docker network

In the following sections I provide some tips to achieve the goal of getting Hadoop running seamlessly on Docker. If you prefer to skip the details and check the installation script with some Docker images to get it up and running, have a look here.

The installation can be divided into three stages:

  1. Docker network setup
  2. Docker containers configuration
  3. Hadoop installation in the nodes

1. Docker Network Setup

First things first, initially a Docker network needs to be created so the containers running in it can see each other.

docker network create — subnet=172.18.0.0/16 hadoopnet

Running Docker containers has its own tricky details, one of them is the lack of a decent support to setup /etc/hosts in the images. One workaround is to define the hostnames at runtime using the — add-host-node parameter (available from Docker 1.15 and above).

As an example, the following line defines the nodemaster node with its own IP and hostname. Notice that the rest of the nodes in the network have to be provided as part of the parameters:

docker run -d — net hadoopnet — ip 192.168.1.1 — hostname nodemaster — add-host node2:192.168.1.2 — add-host node3:192.168.1.3 — name nodemaster -it hadoopbase

2. Docker Containers configuration

Hadoop requires Java8 so the next step is to get it installed in the box. Once that is done its time to setup SSH in the so called password-less configuration so the communication from the nodemaster to the workers can be done automatically without password.

One option here is to leave the SSH accounts without password but a much better approach is to use SSH certificates in the accounts. There is an insane number of posts about that.

You can check the build script here to see how the key is generated and then copied into the worker images when the Docker image is created.

The SSH authentication has to be setup to enable master-to-worker login, not the other way around.

3. Hadoop installation in the nodes

Once the ssh connectivity between the containers is working, the last step is to setup Hadoop itself by copying the configuration files into the master and worker boxes.

The list of .xml files to be modified can be checked here. If the configuration and the hostname in the containers are correct, the same configuration files can be copied across all the containers, master and workers. Each Hadoop instance will understand its own role and use the config parameters it may required.

As a proof that everything is working fine, after running start-dfs.sh in the master node, the jps command will display NameNode and SecondaryNameNode processes in the master and Datanode process in the worker nodes.

Conclusion

By using Docker and a Docker network we can leverage the gap between a local setup and a real multi-cluster installation at a price of zero. This type of setups can be used to check how Hadoop works and simulate specific cluster behavior adding/removing workers to the network.

--

--

Ruben

Data analytics, data management in financial services . Solutions Architect @ AWS . http://rubenafo.com