How to integrate LVM with Hadoop and provide elasticity to Datanode storage?

Published in

LinuxWorld Informatics Pvt. Ltd.

11 min readFeb 11, 2021

What is hadoop?

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

What is Hadoop Cluster?

A Hadoop cluster is nothing but a group of computers connected together via LAN. We use it for storing and processing large data sets. Hadoop clusters have a number of commodity hardware connected together. They communicate with a high-end machine which acts as a master. These master and slaves implement distributed computing over distributed data storage. It runs open source software for providing distributed functionality. What is the Basic Architecture of Hadoop Cluster?

Hadoop cluster has master-slave architecture.

i. Master in Hadoop Cluster

It is a machine with a good configuration of memory and CPU. It is also referred to as Namenode.

Functions of NameNode

Manages file system namespace
Regulates access to files by clients
Stores metadata of actual data. For example — file path, number of blocks, block id, the location of blocks etc.
Executes file system namespace operations like opening, closing, renaming files and directories

The NameNode stores the metadata in the memory for fast retrieval. Hence we should configure it on a high-end machine.

ii. Slaves in the Hadoop Cluster

It is a machine with a normal configuration. It is also referred to as Datanode.

Functions of DataNode

It stores the business data.
It does read, write and data processing operations
Upon instruction from a master, it does creation, deletion, and replication of data blocks.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

What is LVM?

Managing disk space has always been a significant task for sysadmins. Running out of disk space used to be the start of a long and complex series of tasks to increase the space available to a disk partition. It also required taking the system off-line. This usually involved installing a new hard drive, booting to recovery or single-user mode, creating a partition and a filesystem on the new hard drive, using temporary mount points to move the data from the too-small filesystem to the new, larger one, changing the content of the /etc/fstab file to reflect the correct device name for the new partition, and rebooting to remount the new filesystem on the correct mount point.

LVM allows for very flexible disk space management. It provides features like the ability to add disk space to a logical volume and its filesystem while that filesystem is mounted and active ( Dynamically / ONLINE ) and it allows for the collection of multiple physical hard drives and partitions into a single volume group which can then be divided into logical volumes.

LET’S INTEGRATE THE ABOVE TWO CONCEPTS TO ACHIEVE SOMETHING GREAT!

I’ll be demonstrating this practical using Oracle VirtualBox on a RHEL8 Virtual Machine.

STEP 1: Attach two (or more) new harddisk to our VM

Navigate to your VM’s settings tab -> Storage -> Click on controller SATA -> then click on the ➕ to add a new hard-disk.

Give your Harddisk a name (say newHD) and specify the size (in our case I’m going for 50 GB) and then click on create.

Similarly create and attach one more HD.(I’ve atttached a 30 GB HD)

STEP 2: Create a Physical volume (PV) from these drives

Use the command pvcreate <device-name> to create a PV from these drives.

Check the created PVs using the command pvdisplay [<device-name>]

STEP 3: Create a VOLUME GROUP (VG) from these PVs

Now we’ll combine these two PVs to create a single volume of size 80 GB (50GB + 30GB). We have to create a PV because only PV can contribute to VG. To create a VG from PVs, use the command vgcreate <vg-name> <pv1-name> <pv2-name> . We can contribute volumes from multiples PVs to create a VG.

Check the created VG using the command vgdisplay [<vgname>]

In the above image, it is clearly visible that the VG named “myVG” has a total VG SIZE of 79.99 i.e. 80 GiB (50 GiB from PV : /dev/sdb + 30 GiB from PV :/dev/sdc). This VG is currently taking volume from 2 PVs : Cur PV 2

Also notice that initially the pvdisplay command displayed PVs with the status Allocatable NO because they weren’t associated with any VG. But try running the same command after creating a VG from these PVs, it will now show the status as Allocatable YES

STEP 4: Creating Logical Volume (LV) from the created VG

Now we can create partitions called as Logical Volume (LV) from the VG we’ve created. In physical devices we can create only 4 partitions, hence generally we create 3 primary partitions and 1 extended partition, and further under this extended partition we can create multiple logical partitions. This is the case of real physical volumes.

However in case of VG, we can create as many as partitions you want. You don’t have to create any kind of primary or extended partitions. You can have a number of partitions from this VG and each partition here is called as Logical Volume (LV)

In the previous image of vgdisplay notice that the Allocated size: Alloc Size 0 and Free size: Free Size 79.99 GiB

To create an LV from this VG, use the command lvcreate --size <LV-size> --name <LV-name> <vg-name> . Here you can give any size for your LV out of the total size available in VG (i.e. 80 GiB).

I’ve used the command lvdisplay <vg-name>/<lv-name> to see the created LV. Notice, here you have to also give the VG name with the LV name to display it.

I’ve created a partition (LV) of size 60 GB using two harddisks of size 50 GB and 30 GB respectively. This is the magic of the concept named Logical Volume Management (LVM).

Now if you use vgdisplay mgVG you’ll see that the allocated size is 60 GiB and Free size is 20 GiB, since we’ve used 60 GiB out of 80 GiB from this VG to create an LV.

STEP 5: Formatting the LV and mounting it

To use this LV (which is a partition) we have to first format it with some file system. I’ll be using the ext4 file system. For this use the command mkfs.ext4 <LV-path>. Use the LV Path as shown in the first line of the output shown by lvdisplay command.

And then we’ll mount it to some drive using mount <LV-path> <mount-dir>. Also see check if the mount was successful or not using df -h command.

STEP 6: It’s time to configure HADOOP MASTER

The above LVM settings we’ve done must be done on a system which you want to act as a Slave node of Hadoop cluster. Now login to another system which you want to configure as a MASTER node of Hadoop cluster.

You must have java and hadoop installed in this systems. I’ll be using the rpm command for installing java and hadoop. For this you must have the rpm packages of hadoop and java.

First install java using rpm -ivh <jdk-pkg-name>.rpm and then hadoop using rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force . Check for successful installation :

Create a directory in hadoop with any name, which will be used later by master and then navigate to the /etc/hadoop/ directory.

Here we have to configure 2 files viz. hdfs-site.xml which is used to specify the directory path which we’ve created above i.e. “/mast” in my case, and core-site.xml which is used to specify the master IP and Port number.

Write the configuration details in above files as specified in the above screenshot. Also you can use any unused PORT NUMBER according to your choice (I’m using port no. 9001).

After making changes to the hadoop configuration files, next step is to format the directory/storage we’ve created according to hadoop file system. For this use the command : hadoop namenode -format

Last step is to start the namenode (MASTER NODE) service using the command hadoop-daemon.sh start namenode . Check if the service has successfully started or not using jps command.

STEP 7: Configuring HADOOP SLAVE NODE

Now let’s get back to the system in which we’ve done the LVM setup, so that we can configure it as the slave node/datanode of hadoop. Slave node always shares it’s storage to the master node from one of it’s directory. In the previous steps, we’ve already created an LV (partition), mounted it on a directory (named “/nn”) and the same we’ll be contributing to the master node.

Similar to master node, install the java and hadoop packages in the slave node. After successful installation, navigate to /etc/hadoop/ directory. Here the core-site.xml file should be configured in the exact same way as we’ve done in the master node. Also there’ll be only some minute changes to the hdfs-site.xml as compared to the one on the master node, because in this slave node’s file we’ve to specify the directory which the slave is going to contribute to the master.

Next step is to start the data node service using the command hadoop-daemon.sh start datanode and check using jps command

STEP 8: Check HADOOP CLUSTER REPORT

Having configured the above systems as the master and slave of the hadoop cluster, it’s time to check if they’re connected correctly to form a CLUSTER. To do this we’ll use the hadoop dfsadmin -report command.

It looks like everything is perfectly done. In the above SS, it is visible that one slave node is connected to the master. Also notice that this slave is sharing the LV mounted directory, and hence is contributing almost 60 GiB of volume (because we’ve created an LV of the same size) to the master node.

STEP 9: Dynamically increasing the volume of DATANODE

Now let’s witness the power of LVM which allows us to dynamically increase/decrease the partition size i.e. make changes ONLINE. (OFFLINE refers to when we have to first unmount the partition directory and then make changes, while in ONLINE we can directly make the changes and they’ll reflect dynamically).

Let’s try extending the LV size by 10 GiB. However remember one thing that this LV “myLV” is created from a VG named “myVG”, i.e. it is getting space from this VG. Therefore we can extend the size of this LV by ’n’ bytes, only if we have ’n’ bytes available/free in it’s associated VG.

We can have multiple VGs in our system, however an LV can get space from only that VG, using which it was created. It can’t get space from any other VG.

In our case, our VG has a total size of 80 GiB, of which 60 GiB we’ve contributed to the LV named “myLV”. Therefore our VG is left with only 20 GiB of free space. So we are allowed to extend our LV by adding more 10 GiB to it’s size. To extend an LV by ’n’ GiB, we have to use the command lvextend --size +<n>G <lv-path>.This will increase the size of your partition dynamically /online without any downtime to the system.

However if you use df -h command, you will see the older size i.e. 60 GiB.

This is because, whenever we format any partition, either whole partition is formatted or none is formatted. But in our case, previous 60 GiB is

formatted with ext4 filesystem, and the newly added 10 GiB is unformatted and df -h command always shows only that part which is formatted. Here since the remaining 10 GiB is not formatted, we cannot use it. We can store data only in the formatted part.

Therefore we have to format the remaining part. Here if we use mkfs command to format this LV, it will format the entire 70 GiB, and we will lose our entire data which was stored in the formatted 60 GiB. For such use-cases we have another interesting command called resize2fs . This command will not touch the already formatted part, it will only format the unformatted region. Use resize2fs <lv-path> and check using df -h

Let’s also check the Hadoop cluster report to see how much datanode is now contributing to the cluster.

From the above image we can see that our idea of LVM integration with hadoop is successful and is working perfectly. Note that we can also extend the VG size incase their is no free space left in the VG.