An exercise in Discovery, Building Docker Images, using Makefiles & Docker Compose. — Part 5a

George Leonard
4 min readAug 25, 2024

--

Let’s build Apache Hadoop DFS cluster.

(See: Part 4)

(25 August 2024)

Overview

OK… So we now have a base Ubuntu 20.04 image, we’ve added some useful tooling.

We then used that as a source and installed Open JDK 11.

After that we installed the Apache Hadoop 3.3.5 distribution. This will be our base server from which we now derive our different HDFS servers making up our cluster.

At this point we want to be in build-hadoop-openjdk11-hdfs directory, where we will see the following sub directories.

  • namenode
  • datanode
  • resourcemanager
  • nodemanager
  • historyserver

To build the various servers we can issue make build from our Makefile. You will notice in the Makefile this basically execute the docker build for each of the above using the following command.

build:  
sudo docker build -t hadoop-namenode-$(HADOOP_VERSION):$(VERSION) ./namenode
sudo docker build -t hadoop-datanode-$(HADOOP_VERSION):$(VERSION) ./datanode
sudo docker build -t hadoop-resourcemanager-$(HADOOP_VERSION):$(VERSION) ./resourcemanager
sudo docker build -t hadoop-nodemanager-$(HADOOP_VERSION):$(VERSION) ./nodemanager
sudo docker build -t hadoop-historyserver-$(HADOOP_VERSION):$(VERSION) ./historyserver

The above tells docker to build the Dockerfile located in the ./<directory name> directory

Lets take a peek into the namenode Dockerfile as a start.

FROM hadoop-base-3.3.5-java11:1.0.1

RUN echo "--> Build Hadoop HDFS Namenode"

HEALTHCHECK CMD curl -f http://localhost:9870/ || exit 1

ENV HDFS_CONF_dfs_namenode_name_dir=file:///hadoop/dfs/warehouse
RUN mkdir -p /hadoop/dfs/warehouse
VOLUME /hadoop/dfs/warehouse

ADD bin/run.sh /run.sh
RUN chmod a+x /run.sh

EXPOSE 9870

CMD ["/run.sh"]

Nothing to special in there, except, we define a HEALTHCECK CMD command, and we expose the port via the EXPOSE 9870 primitive and then like previous we copy in a run.sh file.

Let’s look at another node, say the nodemanager Dockerfile.

FROM hadoop-base-3.3.5-java11:1.0.1

RUN echo "--> Build Hadoop HDFS Nodemanager"

HEALTHCHECK CMD curl -f http://localhost:8042/ || exit 1

ADD bin/run.sh /run.sh
RUN chmod a+x /run.sh

EXPOSE 8042

CMD ["/run.sh"]

As can be seen, very similar… pretty much the only difference is for the nodemanager we don’t define a directory and a volume and the port we’re EXPOSE is different.

One more, let’s look at the datanode Dockerfile.

FROM hadoop-base-3.3.5-java11:1.0.1

RUN echo "--> Build Hadoop HDFS Datanode"

HEALTHCHECK CMD curl -f http://localhost:9864/ || exit 1

ENV HDFS_CONF_dfs_datanode_data_dir=file:///hadoop/dfs/warehouse
RUN mkdir -p /hadoop/dfs/warehouse
VOLUME /hadoop/dfs/warehouse

ADD bin/run.sh /run.sh
RUN chmod a+x /run.sh

EXPOSE 9864

CMD ["/run.sh"]

Ok, so this is starting to look like rinse and repeat, surely we can do this with some fancy environment/build variables, (and I’m sure someone with more experience is doing it that way) … so why are we building them different, like above, well the difference for each come in the run.sh file copied into the image.

NOTE: for a real production cluster we will have the /hadoop/dfs/warehouse directory located on dedicated highly available external storage.

Now, let’s think back a bit… remember the base image had a entrypoint.sh, what was the significance of that…

For that let's look at a docker compose.yml file used to stand up the cluster.

# docker-compose -p my-project up -d
#
services:

#### Hadoop / HDFS ####
#
# The Namenode UI can be accessed at http://localhost:9870/⁠ and
# the ResourceManager UI can be accessed at http://localhost:8089/⁠

namenode:
image: hadoop-namenode-3.3.5-java11:1.0.0
container_name: namenode
volumes:
- ./data/hdfs/namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=${CLUSTER_NAME}
env_file:
- ./hdfs/hadoop.env
ports:
- "9870:9870" # NameNode Web UI

resourcemanager:
image: hadoop-resourcemanager-3.3.5-java11:1.0.0
container_name: resourcemanager
restart: on-failure
depends_on:
- namenode
- datanode1
- datanode2
- datanode3
- datanode4
- datanode5
environment:
- CLUSTER_NAME=${CLUSTER_NAME}
env_file:
- ./hdfs/hadoop.env
ports:
- "8089:8088" # Resource Manager Web UI

historyserver:
image: hadoop-historyserver-3.3.5-java11:1.0.0
container_name: historyserver
depends_on:
- namenode
- datanode1
- datanode2
environment:
- CLUSTER_NAME=${CLUSTER_NAME}
volumes:
- ./data/hdfs/historyserver:/hadoop/yarn/timeline
env_file:
- ./hdfs/hadoop.env
ports:
- "8188:8188"

nodemanager1:
image: hadoop-nodemanager-3.3.5-java11:1.0.0
container_name: nodemanager1
depends_on:
- namenode
- datanode1
- datanode2
environment:
- CLUSTER_NAME=${CLUSTER_NAME}
env_file:
- ./hdfs/hadoop.env
ports:
- "8042:8042" # NodeManager Web UI

datanode1:
image: hadoop-datanode-3.3.5-java11:1.0.0
container_name: datanode1
depends_on:
- namenode
environment:
- CLUSTER_NAME=${CLUSTER_NAME}
volumes:
- ./data/hdfs/datanode1:/hadoop/dfs/warehouse
env_file:
- ./hdfs/hadoop.env

datanode2:
image: hadoop-datanode-3.3.5-java11:1.0.0
container_name: datanode2
depends_on:
- namenode
environment:
- CLUSTER_NAME=${CLUSTER_NAME}
volumes:
- ./data/hdfs/datanode2:/hadoop/dfs/warehouse
env_file:
- ./hdfs/hadoop.env

datanode3:
image: hadoop-datanode-3.3.5-java11:1.0.0
container_name: datanode3
depends_on:
- namenode
environment:
- CLUSTER_NAME=${CLUSTER_NAME}
volumes:
- ./data/hdfs/datanode3:/hadoop/dfs/warehouse
env_file:
- ./hdfs/hadoop.env

datanode4:
image: hadoop-datanode-3.3.5-java11:1.0.0
container_name: datanode4
depends_on:
- namenode
environment:
- CLUSTER_NAME=${CLUSTER_NAME}
volumes:
- ./data/hdfs/datanode4:/hadoop/dfs/warehouse
env_file:
- ./hdfs/hadoop.env

datanode5:
image: hadoop-datanode-3.3.5-java11:1.0.0
container_name: datanode5
depends_on:
- namenode
environment:
- CLUSTER_NAME=${CLUSTER_NAME}
volumes:
- ./data/hdfs/datanode5:/hadoop/dfs/warehouse
env_file:
- ./hdfs/hadoop.env

# Without a network explicitly defined, you hit this Hive/Thrift error
# java.net.URISyntaxException Illegal character in hostname
# https://github.com/TrivadisPF/platys-modern-data-platform/issues/231
networks:
default:
name: ${COMPOSE_PROJECT_NAME}

Note: The project root has a .env file which populates COMPOSE_PROJECT_NAME & CLUSTER_NAME variables.

First, notice each service defined have an env_file: specified, pointing to a local ./hdfs/hadoop.env file.

Ok, let’s take a peek into this file (hadoop.env):

CORE_CONF_fs_defaultFS=hdfs://namenode:9000
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec
CORE_CONF_dfs_replication=2

HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=true

YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_scheduler_class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___mb=8192
YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___vcores=4
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_mapreduce_map_output_compress=true
YARN_CONF_mapred_map_output_compress_codec=org.apache.hadoop.io.compress.SnappyCodec
YARN_CONF_yarn_nodemanager_resource_memory___mb=16384
YARN_CONF_yarn_nodemanager_resource_cpu___vcores=8
YARN_CONF_yarn_nodemanager_disk___health___checker_max___disk___utilization___per___disk___percentage=98.5
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffle

MAPRED_CONF_mapreduce_framework_name=yarn
MAPRED_CONF_mapred_child_java_opts=-Xmx4096m
MAPRED_CONF_mapreduce_map_memory_mb=4096
MAPRED_CONF_mapreduce_reduce_memory_mb=8192
MAPRED_CONF_mapreduce_map_java_opts=-Xmx3072m
MAPRED_CONF_mapreduce_reduce_java_opts=-Xmx6144m
MAPRED_CONF_yarn_app_mapreduce_am_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.3.5/
MAPRED_CONF_mapreduce_map_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.3.5/
MAPRED_CONF_mapreduce_reduce_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.3.5/

So… what happens here, when the image is started as a container, these values are pushed into the running container and set as environment variables… ok… nice… but why is that special. Well next up is that entrypoint.sh file. See the local base/bin directory for that.

To be Continued… in Part 5b.

My Repo’s

All the code used during this article will be available on the below GIT repo.

Building Docker Images

About Me

I’m a techie, a technologist, always curious, love data, have for as long as I can remember always worked with data in one form or the other, Database admin, Database product lead, data platforms architect, infrastructure architect hosting databases, backing it up, optimizing performance, accessing it. Data data data… it makes the world go round.

In recent years, pivoted into a more generic Technology Architect role, capable of full stack architecture.

George Leonard

georgelza@gmail.com

--

--

George Leonard

I'm a techie, a technologist, technology architect, full stack architect, Always curious, Love data and data platforms.