Hadoop 101: Getting Started.

5 min readDec 16, 2017

Sometimes we all try very hard in getting done something but we are stopped by a very small margin. It is only until we halt, reorganize and look around for help is when the problem is solved.

I had a similar experience while getting familiar with Hadoop.

And here is my take, Try Try Till you succeed. Anyway, this post is about correct way to install Hadoop and Run Word Count Program for Shakespeare literature text file correctly.

System Specifications:

2GB Lubuntu VM (Lubuntu is a lightweight version of ubuntu it can be found here)

tejasghalsasi/hadoop_install_simplified

hadoop_install_simplified - Hadoop 2.8.2 installation Script

github.com

I had found a similar script but it was a bit old and I modified it and here it is.

After Running it most of the components would be in place.

Follow this guide to complete the installation configurations:

Alternatively you can use this guide as a reference for installation as well.

Hadoop 2.6.5 - Installing on Ubuntu 16.04 (Single-Node Cluster) - 2017

Hadoop 2.6.5 Installing on Ubuntu 16.04 (Single-Node Cluster)

www.bogotobogo.com

Once all the above is done, Create a Folder on Desktop and create the Mapper, Reducer and the WordCount Java classes.

WordCount.java:

package org;
import java.io.IOException;
    import java.util.StringTokenizer;    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;    public class WordCount {        public static void main(String[] args) throws Exception {
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf, "word count");
            job.setJarByClass(WordCount.class);
            job.setMapperClass(WordCountMapper.class);
            job.setCombinerClass(WordCountReducer.class);
            job.setReducerClass(WordCountReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }

WordCountReducer.java

package org;
import java.io.IOException;
    import java.util.StringTokenizer;    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
            private IntWritable result = new IntWritable();            public void reduce(Text key, Iterable<IntWritable> values, Context context)
                    throws IOException, InterruptedException {
                int sum = 0;
                for (IntWritable val : values) {
                    sum += val.get();
                }
                result.set(sum);
                context.write(key, result);
            }
        }

WordCountMapper.java

package org;
import java.io.IOException;
    import java.util.StringTokenizer;    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
        public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();            public void map(Object key, Text value, Context context) throws IOException,
                    InterruptedException {
                StringTokenizer itr = new StringTokenizer(value.toString());
                while (itr.hasMoreTokens()) {
                    word.set(itr.nextToken());
                    context.write(word, one);
                }
            }
        }

Create a Manifest File as Manifest.txt

Main-Class: Org.WordCount

Hit Enter after completing the word WordCount

Notice here that Org is a sub folder inside which the class files are stored. in my case I am at MR. The folder input contains the text file input.txt

Now that we have all the required components we create a jar file and run Hadoop

hadoop@hadoop:~/Desktop/MR$ export CLASSPATH="/usr/local/hadoop/hadoop-2.8.2/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.8.0.jar:/usr/local/hadoop/hadoop-2.8.2/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.8.0.jar:/usr/local/hadoop/hadoop-2.8.2/share/hadoop/common/hadoop-common-2.2.0.jar:~/Desktop/MR/*:/usr/local/hadoop/hadoop-2.8.2/lib/*"
hadoop@hadoop:~/Desktop/MR$ javac -d . WordCount.java WordCountMapper.java WordCountReducer.java
hadoop@hadoop:~/Desktop/MR$ jar cfm WordCount.jar Manifest.txt org/*.class

This should create class files and a jar file called WordCount.jar

Now we start hadoop . I have installed hadoop in

/usr/local/hadoop/hadoop-2.8.2/sbin

If you have configured hadoop correctly, you should be having hadoop sbin in your path. for more info on correctly installing check here : http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_16_04_single_node_cluster.php

to start we are going to use

./start-dfs.sh

And after that

./start.yarn.sh

Important: I am now in MR Folder on Desktop with all setup done. Follow the following commands:

hadoop@hadoop:~/Desktop/MR$ hadoop jar WordCount.jar org.WordCount /input outputJava HotSpot(TM) Server VM warning: You have loaded library /usr/local/hadoop/hadoop-2.8.2/lib/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
17/11/18 19:48:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/18 19:48:25 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/11/18 19:48:28 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/11/18 19:48:28 INFO input.FileInputFormat: Total input files to process : 1
17/11/18 19:48:29 INFO mapreduce.JobSubmitter: number of splits:1
17/11/18 19:48:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1511060028477_0001
17/11/18 19:48:30 INFO impl.YarnClientImpl: Submitted application application_1511060028477_0001
17/11/18 19:48:30 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1511060028477_0001/
17/11/18 19:48:30 INFO mapreduce.Job: Running job: job_1511060028477_0001
17/11/18 19:48:55 INFO mapreduce.Job: Job job_1511060028477_0001 running in uber mode : false
17/11/18 19:48:55 INFO mapreduce.Job:  map 0% reduce 0%
17/11/18 19:49:28 INFO mapreduce.Job:  map 100% reduce 0%
17/11/18 19:49:50 INFO mapreduce.Job:  map 100% reduce 100%
17/11/18 19:50:02 INFO mapreduce.Job: Job job_1511060028477_0001 completed successfully
17/11/18 19:50:02 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=983515
        FILE: Number of bytes written=2243381
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=5465500
        HDFS: Number of bytes written=721220
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=30132
        Total time spent by all reduces in occupied slots (ms)=19102
        Total time spent by all map tasks (ms)=30132
        Total time spent by all reduce tasks (ms)=19102
        Total vcore-milliseconds taken by all map tasks=30132
        Total vcore-milliseconds taken by all reduce tasks=19102
        Total megabyte-milliseconds taken by all map tasks=30855168
        Total megabyte-milliseconds taken by all reduce tasks=19560448
    Map-Reduce Framework
        Map input records=124796
        Map output records=904087
        Map output bytes=8575070
        Map output materialized bytes=983515
        Input split bytes=103
        Combine input records=904087
        Combine output records=67799
        Reduce input groups=67799
        Reduce shuffle bytes=983515
        Reduce input records=67799
        Reduce output records=67799
        Spilled Records=135598
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=957
        CPU time spent (ms)=13940
        Physical memory (bytes) snapshot=386445312
        Virtual memory (bytes) snapshot=1198010368
        Total committed heap usage (bytes)=287834112
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=5465397
    File Output Format Counters
        Bytes Written=721220

This is the output you should be getting.

It can be browsed at your HDFS Storage Directory:

The Success file indicates that the job is a success . The part file has the actual data mined output. It can be found here: part-r-00000

The original Shakespeare Text file can be found here: input.txt

References:

Please check out my Git Repository for a Quick Start VMware disk of Lubuntu 32-bit with Hadoop Pre configured.

tejasghalsasi/hadoop_install_simplified

hadoop_install_simplified - Hadoop 2.8.2 installation Script

github.com

Please Watch this video to see the execution of the above program:

Disclaimer:

This post is a collection of information from all across the internet. No copyright infringement intended.

Hadoop 101: Getting Started.

tejasghalsasi/hadoop_install_simplified

hadoop_install_simplified - Hadoop 2.8.2 installation Script

Hadoop 2.6.5 - Installing on Ubuntu 16.04 (Single-Node Cluster) - 2017

Hadoop 2.6.5 Installing on Ubuntu 16.04 (Single-Node Cluster)

tejasghalsasi/hadoop_install_simplified

hadoop_install_simplified - Hadoop 2.8.2 installation Script

Written by testuser996