hadoop.apache.org

Hadoop 101: Getting Started.

testuser996
5 min readDec 16, 2017

--

Sometimes we all try very hard in getting done something but we are stopped by a very small margin. It is only until we halt, reorganize and look around for help is when the problem is solved.

I had a similar experience while getting familiar with Hadoop.

And here is my take, Try Try Till you succeed. Anyway, this post is about correct way to install Hadoop and Run Word Count Program for Shakespeare literature text file correctly.

System Specifications:

2GB Lubuntu VM (Lubuntu is a lightweight version of ubuntu it can be found here)

I had found a similar script but it was a bit old and I modified it and here it is.

After Running it most of the components would be in place.

Follow this guide to complete the installation configurations:

Alternatively you can use this guide as a reference for installation as well.

Once all the above is done, Create a Folder on Desktop and create the Mapper, Reducer and the WordCount Java classes.

WordCount.java:

package org;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount { public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

WordCountReducer.java

package org;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

WordCountMapper.java

package org;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

Create a Manifest File as Manifest.txt

Main-Class: Org.WordCount

Hit Enter after completing the word WordCount

Notice here that Org is a sub folder inside which the class files are stored. in my case I am at MR. The folder input contains the text file input.txt

Now that we have all the required components we create a jar file and run Hadoop

hadoop@hadoop:~/Desktop/MR$ export CLASSPATH="/usr/local/hadoop/hadoop-2.8.2/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.8.0.jar:/usr/local/hadoop/hadoop-2.8.2/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.8.0.jar:/usr/local/hadoop/hadoop-2.8.2/share/hadoop/common/hadoop-common-2.2.0.jar:~/Desktop/MR/*:/usr/local/hadoop/hadoop-2.8.2/lib/*"
hadoop@hadoop:~/Desktop/MR$ javac -d . WordCount.java WordCountMapper.java WordCountReducer.java
hadoop@hadoop:~/Desktop/MR$ jar cfm WordCount.jar Manifest.txt org/*.class

This should create class files and a jar file called WordCount.jar

Now we start hadoop . I have installed hadoop in

/usr/local/hadoop/hadoop-2.8.2/sbin

If you have configured hadoop correctly, you should be having hadoop sbin in your path. for more info on correctly installing check here : http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_16_04_single_node_cluster.php

to start we are going to use

./start-dfs.sh

And after that

./start.yarn.sh

Important: I am now in MR Folder on Desktop with all setup done. Follow the following commands:

hadoop@hadoop:~/Desktop/MR$ hadoop jar WordCount.jar org.WordCount /input outputJava HotSpot(TM) Server VM warning: You have loaded library /usr/local/hadoop/hadoop-2.8.2/lib/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
17/11/18 19:48:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/18 19:48:25 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/11/18 19:48:28 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/11/18 19:48:28 INFO input.FileInputFormat: Total input files to process : 1
17/11/18 19:48:29 INFO mapreduce.JobSubmitter: number of splits:1
17/11/18 19:48:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1511060028477_0001
17/11/18 19:48:30 INFO impl.YarnClientImpl: Submitted application application_1511060028477_0001
17/11/18 19:48:30 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1511060028477_0001/
17/11/18 19:48:30 INFO mapreduce.Job: Running job: job_1511060028477_0001
17/11/18 19:48:55 INFO mapreduce.Job: Job job_1511060028477_0001 running in uber mode : false
17/11/18 19:48:55 INFO mapreduce.Job: map 0% reduce 0%
17/11/18 19:49:28 INFO mapreduce.Job: map 100% reduce 0%
17/11/18 19:49:50 INFO mapreduce.Job: map 100% reduce 100%
17/11/18 19:50:02 INFO mapreduce.Job: Job job_1511060028477_0001 completed successfully
17/11/18 19:50:02 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=983515
FILE: Number of bytes written=2243381
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5465500
HDFS: Number of bytes written=721220
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=30132
Total time spent by all reduces in occupied slots (ms)=19102
Total time spent by all map tasks (ms)=30132
Total time spent by all reduce tasks (ms)=19102
Total vcore-milliseconds taken by all map tasks=30132
Total vcore-milliseconds taken by all reduce tasks=19102
Total megabyte-milliseconds taken by all map tasks=30855168
Total megabyte-milliseconds taken by all reduce tasks=19560448
Map-Reduce Framework
Map input records=124796
Map output records=904087
Map output bytes=8575070
Map output materialized bytes=983515
Input split bytes=103
Combine input records=904087
Combine output records=67799
Reduce input groups=67799
Reduce shuffle bytes=983515
Reduce input records=67799
Reduce output records=67799
Spilled Records=135598
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=957
CPU time spent (ms)=13940
Physical memory (bytes) snapshot=386445312
Virtual memory (bytes) snapshot=1198010368
Total committed heap usage (bytes)=287834112
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=5465397
File Output Format Counters
Bytes Written=721220

This is the output you should be getting.

It can be browsed at your HDFS Storage Directory:

The Success file indicates that the job is a success . The part file has the actual data mined output. It can be found here: part-r-00000

The original Shakespeare Text file can be found here: input.txt

References:

Please check out my Git Repository for a Quick Start VMware disk of Lubuntu 32-bit with Hadoop Pre configured.

Please Watch this video to see the execution of the above program:

Disclaimer:

This post is a collection of information from all across the internet. No copyright infringement intended.

--

--