Hadoop Performance Evaluation by Benchmarking and Stress Testing with TeraSort and TestDFSIO
Now a days, Big Data is a hot topic in industrial fields. The importance of Big Data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish many business-related tasks.
Hadoop has been developed as a solution to Big Data. It provides reliable, scalable, fault-tolerance and efficient service for large scale data processing based on HDFS and MapReduce. HDFS stands for the Hadoop distributed file system and provides the distributed storage for the system. MapReduce provides the distributed processing for Hadoop.
In this article, we will analyse different stats such as running time, performance, I/O rate, the throughput by running TestDFSIO and TeraSort Benchmark on Hadoop clusters.
The experimental cluster I used consists of 6 nodes. One of them is designed to serve as a master node and the other 5 nodes are designed to be slave nodes or core nodes. The master node manages the cluster and typically runs master components of distributed applications. Core nodes run the Data Node daemon to coordinate data storage as part of the Hadoop Distributed File System (HDFS). I used AWS EMR to create the cluster. The hardware information of each node is as follows:
- 8 vCore
- 15 GiB memory
- 80 SSD GB storage
These configurations are of m3.xlarge instance type.
I ssh-ed into the master node to run the following benchmarks from the command line interface.
TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such as stress testing HDFS, to discover performance bottlenecks in your network, to shake out the hardware, OS and Hadoop setup of your cluster machines (particularly the NameNode and the DataNodes) and to give you a first impression of how fast your cluster is in terms of I/O.
TestDFSIO is designed in such a way that it will use 1 map task per file, i.e. it is a 1:1 mapping from files to map tasks. Splits are defined so that each map gets only one filename, which it creates (
-write) or reads (
The command to run a test :
hadoop jar hadoop-*test*.jar TestDFSIO -write|-read -nrFiles <no. of output files> -fileSize <size of one file>
TeraSort Benchmark is used to test both, MapReduce and HDFS by sorting some amount of data as quickly as possible in order to measure the capabilities of distributing and mapreducing files in cluster. This benchmark consists of 3 components:
- TeraGen - generates random data
- TeraSort - does the sorting using MapReduce
- TeraValidate - used to validate the output
To generate random data, the following command is used.
hadoop jar $HADOOP_HOME/hadoop-*examples*.jar teragen <number of 100-byte rows> <input dir>
Suppose you want to generate data of 10 GB, then the number of 100-byte rows will be 10⁷.
Explaination : 10 GB =10⁷ * 100 bytes = 10⁹ bytes
To sort the generated data, the following command is used.
hadoop jar $HADOOP_HOME/hadoop-*examples*.jar terasort <input dir> <output dir>
To ensure the data was sorted correctly, the following command is used.
hadoop jar $HADOOP_HOME/hadoop-*examples*.jar teravalidate <output dir> <terasort-validate dir>
Challenges faced while running the benchmark
Since the instances I used were of m3.xlarge type, the memory and storage were limited. As a result, few challenges were faced while benchmarking.
- After running each benchmark test, HDFS should be cleaned so that the next benchmark can be run without any storage issue.
- No two benchmark tests should be run simultaneously so as to avoid memory issues.
I ran a TestDFSIO read-write test for assessing Hadoop performance of the experimental cluster I created. This test was run on 5 files each of size 10 GB, 20 GB, 30 GB, 40 GB and 50 GB.
Following are the different stats extracted from the TestDFSIO benchmark.
In the above figure, you can see that read operation is a bit faster than write operation. This is because disks such as HDD or SSD work like that only.
In hadoop, the task is divided among different blocks, the processing is done parallel and independent to each other. So because of parallel processing, HDFS has good throughput.
I/O rate is the speed with which the data transfer takes place between the hard disk drive and RAM.
I ran a TeraSort test for assessing Hadoop performance of the experimental cluster I created. This test was run on 5 files each of size 10 GB, 20 GB, 30 GB, 40 GB and 50 GB. First I used TeraGen to generate data to be sorted, then used TeraSort to sort the data and finally used TeraValidate to validate the sorted results.
Following are the different stats extracted from the TeraSort benchmark.
There are several interesting details which can be quantified from the above stats:
- A larger amount of data to be sorted implies increasing the runtime, as could be expected a priori.
- For intermediate size of data like 30 GB and 40 GB, cluster of 1 master and 5 slaves is ideal.
- For large size of data, there is less gain with cluster of 1 master and 5 slaves.
Note: I made the bar plots using matplotlib.
Given that Hadoop is expected to handle TB of information, all of the results could be considered highly encouraging, since even handling small size data sets it is possible to obtain performance gains.