Integration of LVM with Hadoop Cluster and providing Elasticity to DataNode Storage

Published in

Geek Culture

10 min readMay 30, 2021

In Linux, Logical Volume Manager (LVM) is a device mapper framework that provides logical volume management for the Linux kernel. Most modern Linux distributions are LVM-aware to the point of being able to have their root file systems on a logical volume.

Heinz Mauelshagen wrote the original LVM code in 1998 when he was working at Sistina Software, taking its primary design guidelines from the HP-UX’s volume manager.

Volume management creates a layer of abstraction over physical storage, allowing you to create logical storage volumes. This provides much greater flexibility in a number of ways than using physical storage directly. In addition, the hardware storage configuration is hidden from the software so it can be resized and moved without stopping applications or unmounting file systems. This can reduce operational costs.

Logical volumes provide the following advantages over using physical storage directly:

Flexible capacity

When using logical volumes, file systems can extend across multiple disks, since you can aggregate disks and partitions into a single logical volume.

Resizeable storage pools

You can extend logical volumes or reduce logical volumes in size with simple software commands, without reformatting and repartitioning the underlying disk devices.

Online data relocation

To deploy newer, faster, or more resilient storage subsystems, you can move data while your system is active. Data can be rearranged on disks while the disks are in use. For example, you can empty a hot-swappable disk before removing it.

Convenient device naming

Logical storage volumes can be managed in user-defined and custom-named groups.

Disk striping

You can create a logical volume that stripes data across two or more disks. This can dramatically increase throughput.

Mirroring volumes

Logical volumes provide a convenient way to configure a mirror for your data.

Volume snapshots

Using logical volumes, you can take device snapshots for consistent backups or test the effect of changes without affecting the real data.

Thin volumes

Logical volumes can be thinly provisioned. This allows you to create logical volumes that are larger than the available extents.

Cache volumes

A cache logical volume uses a small logical volume consisting of fast block devices (such as SSD drives) to improve the performance of a larger and slower logical volume by storing the frequently used blocks on the smaller, faster logical volume.

LVM ARCHITECTURE OVERVIEW

The underlying physical storage unit of an LVM logical volume is a block device such as a partition or whole disk. This device is initialized as an LVM physical volume (PV).

To create an LVM logical volume, the physical volumes are combined into a volume group (VG). This creates a pool of disk space out of which LVM logical volumes (LVs) can be allocated. This process is analogous to the way in which disks are divided into partitions. A logical volume is used by file systems and applications (such as databases).

APACHE HADOOP

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

In this article, I have used AWS cloud where I launched my Hadoop Cluster and made one NameNode and One DataNode.

After the launch, we have to set up the NameNode of our Hadoop Cluster and to verify we will use jps command which shows the current Java process running at the moment in the Virtual Machine(JVM).

Similarly, in DataNode we have to configure and start the DataNode Process in the EC2 instance.

After the successful configure and initialization of the above process, it should look like this.

hadoop dfsadmin -report

This command shows the entire cluster report in CLI from Namenode as well as DataNode.

We can also see a Web Console UI Report of the above Cluster by opening the provide URL in any compatible browser like Chrome, Firefox, etc.

http://<IPv4_of_NameNode>:50070/dfshealth.jps

Note: That the security group attached to both the EC2 instances should allow the port number 50070 or else we won’t be able to see anything.

This is the Basic Setup we need to have in order to witness the practical in detail.

Step — 1: Creating and Attaching Two EBS Volumes

We will create two EBS General Purpose SSD Volumes of size 15GiB which will as additional storage used while scaling up our Storage of the DataNode on the go.

As we can see in AWS the two volumes tagged as Physical Drive 1 and Physical Drive 2.

But in the DataNode the name of the Device will be mapped differently like /dev/xvdf and /dev/xvdg which basically mimics /dev/sdf and /dev/sdg.

Step — 2: Install Necessary Packages

Now that we have attached our Volumes, we will download the lvm2 package in our DataNode System. The DataNode system is running RHEL 8 Cloud image(AMI).

We will use the yum package manager to install the lvm2 package with its dependencies.

Step — 3: Creating Physical Volume From Physical Partitions

The /dev/xvdf and /dev/xvdg are the physical partitions that we will be used to create a Physical Volume using pvcreate command.

pvcreate /dev/xvdf /dev/xvdg

There are three commands we can use to display properties of LVM physical volumes: pvs, pvdisplay, and pvscan.

pvs

The pvs command provides physical volume information in a configurable form, displaying one line per physical volume. The pvs command provides a great deal of format control and is useful for scripting.

pvscan

The pvscan command scans all supported LVM block devices in the system for physical volumes.

pvdisplay

The pvdisplay command provides a verbose multi-line output for each physical volume. It displays physical properties (size, extents, volume group, and so on) in a fixed format.

Step — 4: Creating Volume Group of Physical Volumes

Physical volumes are combined into volume groups (VGs). This creates a pool of disk space out of which logical volumes can be allocated.

Within a volume group, the disk space available for allocation is divided into units of a fixed-size called extents. An extent is the smallest unit of space that can be allocated. Within a physical volume, extents are referred to as physical extents.

A logical volume is allocated into logical extents of the same size as the physical extents. The extent size is thus the same for all logical volumes in the volume group. The volume group maps the logical extents to physical extents.

To create a volume group, use the following command:

vgcreate hadoop /dev/xvdf /dev/xvdg

This creates a VG with the name of hadoop. The PVs /dev/xvdf and /dev/xvdg are the base storage level for the VG hadoop.

It is possible to extend the above VG with the PVs later. To extend a VG, use the following command:

vgextend hadoop /dev/xvdh

There are two commands you can use to display properties of LVM volume groups: vgs and vgdisplay.

vgscan

The vgscan command, which scans all supported LVM block devices in the system for volume groups, can also be used to display the existing volume groups.

vgs

The vgs command provides volume group information in a configurable form, displaying one line per volume group. The vgs command provides a great deal of format control and is useful for scripting.

vgdisplay

The vgdisplay command displays volume group properties (such as size, extents, number of physical volumes, and so on) in a fixed form.

Step — 5: Creating Logical Volume form Volume Group

To create a logical volume, use the following command:

lvcreate -L 12G -n Volume1 hadoop

The -n option allows the user to set the LV name to lv01. The -L option allows the user to set the size of LV in units of Mb in this example, but it is possible to use any other units. The LV type is linear by default, but the user can specify the desired type by using the --type option.

There are three commands you can use to display properties of LVM logical volumes: lvs, lvdisplay, and lvscan.

lvs

The lvs command provides logical volume information in a configurable form, displaying one line per logical volume. The lvs command provides a great deal of format control and is useful for scripting.

lvdisplay

The lvdisplay command displays logical volume properties (such as size, layout, and mapping) in a fixed format.

lvscan

The lvscan command scans for all logical volumes in the system and lists them.

Step — 6: Format the Logical Volume

Now, we have to format the Logical Volume in ext4 File System for once in order to use the Storage space.

Step — 7: Create a Mount Point for Volume1 (LV)

After the formatting, now we will mount the Logical Volume to /dn1 directory which we already created while configuring DataNode System.

Note that before mounting we need to stop the DataNode Process.

df -h

This command shows that the Volume1 12GiB space is mounted to /dn1.

After starting the DataNode again, we can see that our Hadoop DataNode now has 12GiB Storage space.

GROWING LOGICAL VOLUMES

To increase the size of a logical volume, we use the lvextend command.

When we extend the logical volume, we can indicate how much we want to extend the volume, or how large we want it to be after we extend it.

After we have extended the logical volume it is necessary to increase the file system size to match.

By default, most file system resizing tools will increase the size of the file system to be the size of the underlying logical volume so we do not need to worry about specifying the same size for each of the two commands. That's why we use --resizefs option in lvextend.

The following command extends the logical volume /dev/hadoop/Volume1 to 16 gigibytes(GiB).

The following command adds 8 gigibyte(GiB) to the logical volume /dev/hadoop/Volume1, which extends to 24GiB.

SHRINKING LOGICAL VOLUMES

We can reduce the size of a logical volume with the lvreduce command.

If the logical volume you are reducing contains a file system, to prevent data loss we must ensure that the file system is not using the space in the logical volume that is being reduced. For this reason, it is recommended that you use the--resizefs option of the lvreduce command when the logical volume contains a file system. When we use this option, the lvreduce command attempts to reduce the file system before shrinking the logical volume. If shrinking the file system fails, as can occur if the file system is full or the file system does not support shrinking, then the lvreduce command will fail and not attempt to shrink the logical volume.

Note that we need to stop the Datanode because shrinking or reducing the storage size is not usual and we need to stop the DataNode process as it considers the action as data loss.

The following command extends the logical volume /dev/hadoop/Volume1 to 7 gigibytes(GiB)

The following command removes 1 gigibyte(GiB) to the logical volume /dev/hadoop/Volume1.

I would like to thanks Vimal Daga Sir for providing an awesome topic to research; it was fun.

“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”
― George R.R. Martin, A Dance with Dragons