Integrating Logical Volume Manager(LVM) with Hadoop | Automating LVM with Python

Published in

Analytics Vidhya

9 min readNov 22, 2020

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop’s data handling capacity comes from its own file system known as HDFS (Hadoop distributed File System), that allows you to store data of various formats across a cluster. In HDFS, Namenode is the master node and Datanodes are the slaves. Namenode contains the metadata about the data stored in Data nodes, such as which data block is stored in which data node, where are the replications of the data block kept etc. The actual data is stored in Data Nodes.

We actually replicate the data blocks present in Data Nodes, and the default replication factor is 3. Since we are using commodity hardware and we know the failure rate of these hardware are pretty high, so if one of the DataNodes fails, HDFS will still have the copy of those lost data blocks. You can also configure replication factor based on your requirements.

HDFS provides a distributed way to store Big data. Your data is stored in blocks across the DataNodes and you can specify the size of blocks. Basically, if you have 512MB of data and you have configured HDFS such that, it will create 128 MB of data blocks. So HDFS will divide data into 4 blocks as 512/128=4 and store it across different DataNodes, it will also replicate the data blocks on different DataNodes.

So, as you might guess in the real world use case scenario the cluster must have multiple (sometimes in millions) such hardware so that the companies can dump their Big Data into it. But the question arises what if one (or say one percent of total) of such DataNodes get its storage completely filled. The DataNodes obviously contains some essential data and hence must also be kept alive (running) at all times till our operations are complete, as the data stored in it may be required at virtually any instance of time and switching it off and on again and again is a waste of time and switching it off permanently is complete waste of that resource. But, keeping it alive is going to consume a lot of energy and hence a lot of money, also the node is not being used for a majority of time if all we can do is read from it. Also no one in our world is going to like a system that consumes a lot of energy or money or both for virtually only being operated upon once in a millennia. So, how should we handle such a situation? Is it even manageable? And if yes is it any good?

The answer to our questions lies in the clever usage of our DataNode’s storage. One might ask, what is this clever usage of our DataNode’s storage? Well well well… I present to you Logical Volume Manager(LVM).

In Linux, LVM is a device mapper framework that provides logical volume management for the Linux kernel. It is a tool implemented in the Linux Kernel that let’s you work with Logical Volumes (LV), volumes that lay between the physical hard drives and the filesystems that will bring these LV to life.

Logical volumes are groups of information located on physical volumes. A hierarchy of structures is used to manage disk storage. Each individual disk drive, called a Physical Volume(PV) has a name, such as /dev/hdisk0. Every PV in use belongs to a Volume Group(VG) which as its name suggest is a group of volumes.

Put simply LVM can help us sort out problem by giving our storage the capability to be extremely flexible. Though we can also attain flexibility in a static volume. Click here to know more. But as you might see it is quite tough to do so, and one bad decision is going to ruin our precious data. Whereas, LVM is quite nice as it presents a cleaner way to do things. Also it allows us to change the size of volume on the fly i.e. it helps us to change the size of the partition while it’s online (still mounted). Integrating such a tool with Hadoop is going to remove the problems faced by us from its root itself.

So let’s try and integrate both these tools…

Task Description: Integrating LVM with Hadoop and providing Elasticity to DataNode Storage.

All the steps done below are done over RHEL-8 while being logged in as the root user

Let’s get started…

First we are going to need a Hadoop Cluster, to get one configured we will need to have install JDK(version 8 or above) and Hadoop (preferably version 1) on at least two different systems on same network (two systems that can ping to each other). Next on one of those systems go to /etc/hadoop and configure two files

hdfs-site.xml: Where you can simply make the changes as shown below

2. core-site.xml: where again make the changes shown below

Next use the command mkdir /arth to make a new folder where name node is going to store the cluster’s metadata

Next you’re gonna have to run two commands:

hadoop namenode -format and hadoop-daemon.sh start namenode this is gonna configure your name node. Next on the other system you will again have to make changes to the same files as shown below

hdfs-site.xml

2. core-site.xml

Next make a folder named dnin / directory by using the command mkdir /dn this is the folder where all the data is going to be stored

After this is done run the command hadoop-daemon.sh start datanode it is going to show some result like this

Next check if the cluster is ready by using the command hadoop dfsadmin -report on any of the two systems. If you can see a connected name node, then you’re ready to work on LVM, else repeat the steps once again. Or try our tool which to simplify this step for yourself (press here). Also see this article to learn more about our tool (press here).

Once our cluster is up and running we are going to have to make DataNode capable of using LVM…

To do so plug in a new storage device into your system which was configured as DataNode. Next see what is the name given to it by our system by using the command fdisk -l

Here in my case the new device is being shown as /dev/sdb with size 10GiB. Next let’s create a Physical Volume of this new device, to do this use the command pvcreate /dev/<name of the volume> in my case the command is

To see all the PVs in your system use the command

Next we are going to make a VG and add this new PV to it. To do so use the command vgcreate <name of the VG> /dev/<name of the volume> hence in my case the command looks something like

You can use the command vgdisplay to show the properties of the VGs or use the command vgdisplay <name of the vg> to see the properties of the respective VG, hence in my case I’ll use the command

Next let’s create a LV from our VG, use the command lvcreate --size <size> --name <name for the lv> <name of the VG> in my case I use the command as follows

Now the LV is ready to be used, it can be used just like a normal static partition. Use the command lvdisplayto find the name of your newly created LV .Hence it’s gonna have to be formatted and mounted manually. In my case I used the commands

At this point we have made the DataNode capable of LVM. Now let’s see what can we do with our LVM capable DataNode by trying to manipulate its volume. Currently my Hadoop report shows

three DataNodes one with size ~5GiB and another visible Datanode with size ~4GiB. I’ll be changing the size of DataNode whose name is given as 49.36.35.26 while my teammate work upon the other DataNode with name 47.31.9.98

Now let’s try increasing the size of the DataNodes

We would have to run just two commands to do so:

lvextend --size +(size in GiB)G /dev/<name of the VG>/<name of the LV> in my case the command is going to be

2. resize2fs /dev/<name of the VG>/<name of the LV> so I write the command as

And just like that the volume of my DataNode would increased by 4GiB. Let’s see what can we see in Hadoop report

And yeah as you can see the size has been increased from ~5Gib to ~9GiB. All the while keeping the volume alive(still mounted).

Automation using Python script…

As you might have seen the capabilities that we are able to achieve python scripting from this article, Why don’t we combine the knowledge of Python and LVM to make a tool that is able to automate this hectic task of making PVs then giving to a VG and then at last creating a LV all the while bearing the burden of remembering all the commands…

Task description: Automating LVM Partition using Python-Script.

Prithviraj-Singh/menu.py

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Here is a link to the code of our tool in which the newest added option is about LVM. Let’s see how it works…

As you might see option number 5, so just press 5 when prompted and you’ll be landed to a page where you’ll get a menu dedicated to LVM

Just like this… As you can see there are all the options essential for working on LVM. Let’s try to work on some…

As you can see on pressing the option 4 and entering the tool will ask you if you’d like to see any particular LV if not then the tool prints all the LVs and their details for you. One of them which you might see here is /dev/myvg/mylv1 whose size is 1GiB. Let’s try to see its VG named myvg

Here you can see all the details about the VG /dev/myvg , we can see here the size that sits free is <7GiB, let’s try to add some space to /dev/myvg/mylv1 and then see what happens to the VG