Part I: Build Hadoop From Source on macOS

The first thing you should know is that the official Apache Documentation is not particularly useful, for macOS — it essentially just tells you to use a Docker container.
Essentially the major impediment is that homebrew by default is running the wrong version of the library for Protocol Buffers — Google’s data interchange format.
Below I’ll outline the steps I took to build Apache Hadoop from source on macOS in August of 2017.
- Create a GitHub fork/clone of
github.com/apache/hadoop, unfortunately there is no good README document associated with this repo. - From within the
hadoopdirectory run the following command:
mvn clean install -DskipTestsThat should result in output to the console that looks like this so —

3. It’s this step in the build process that gave me trouble, namely:
mvn package -Pdist -Pnative -Dtar -DskipTests
Pay close attention to the ERROR message, because it tells you what you need to do, as they write on StackOverflow:

Be careful, because this isn’t the full story, not for Hadoop anyway. That post recommends the use of v2.4.1, but we in fact need v2.5.0.

Here you can see how I created the symbolic link to the correct version, and added it to me ~/.zshrc.

Finally it works like so:


Part II: Running Hadoop Examples
Apache maintains a guide to getting started on Hadoop:
According to the information therein our configuration files should live here:
etc/hadoop/However, on macOS we don’t have any such directory. As we installed with Homebrew our files in fact live here:
/usr/local/Cellar/hadoop/2.8.1/libexec/etc/hadoop
We need our core-site.xml to look like this:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Likewise for hdfs-site.xml
> ssh localhost> ssh: connect to host localhost port 22: Connection refused
To make this work you need to change a variable in your System Preferences:

Subsequently ssh localhost should work.


At first I was freaked out by the fact that this NameNode shut down after I started it up, but in fact this is normal and does not indicate an error has occurred.
The screenshot below does in fact indicate an error:

The following output from hadoop checknative is related to that warning we keep getting:

But the fact that we don’t have those native libraries is not a big deal, things can still function.
As things weren’t working, I followed some of the advice[2] about what to add to my path, such as you can see below:

That appeared as though it got something working, as I saw after paying a visit to:
http://localhost:50070/dfshealth.html#tab-overview
However, at this point, I’m not convinced that what I have there is enough to kick things off in the right way.
One of the most important things to grok about this whole process is hdfs dfs -put, it’s how you load input files into the Hadoop file system so they can subsequently be processed — my failure to comprehend this was the cause of lots of subsequent trouble.
Here’s something I tried out, can’t remember what I was intending to do here exactly — probably I was under the impression that I had to add my Hadoop executable to the Hadoop file system — which in retrospect seems stupid, but I can imagine how at the time I thought that was a reasonable thing to try out.
hdfs dfs -put /usr/local/Cellar/hadoop/2.8.1/libexec/etc/hadoopBelow we can see myself attempting to run the `Pseudo-Distributed Operation`, albeit unsuccessfully.

The screenshot proceeding this sentence is slightly closer to the mark, as I began to realize that I had to align my username, i.e. s.matthew.english with the Hadoop file system.

Here’s something useful, you can see — back when I had poorly configured Hadoop settings the output of the hdfs dfsadmin -report command was all full of 0 and generally depressing information:

In contrast look at what it displays now that my files are configured in the correct way (at least I strongly suspect that is the reason for the rosier picture):

Just now I made reference to my configuration files, “What are they exactly, and what do they look like?” is the question you’re likely asking yourself, please allow me to show you:
Let’s consider /usr/local/Cellar/hadoop/2.8.1/libexec/etc/hadoop, which is the directory that contains the important file core-site.xml, it will be structured accordingly:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Additionally, we’ll need to change hdfs-site.xml to look like this:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>Within usr/local/Cellar/hadoop/2.8.1/libexec/etc/hadoop there are many other files, but for purposes of the `Pseudo-Distributed Operation` demo, it seems as though they can be safely ignored.
When I got hip to the fact that hadoop fs -mkdir -p was actually meant to create a directory in the Hadoop file system for me to add input files to, I started playing around with some commands like this — still not 100% sure — but seems like this kind of this is useful.
> hadoop fs -mkdir -p /user/s.matthew.english> hadoop fs -ls /> 17/09/05 17:56:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable> Found 1 items> drwxr-xr-x - s.matthew.english supergroup 0 2017-09-05 17:54 /user
At this point in my experimentation I’ve been playing around all day, so I’m not EXACTLY sure what was the set of changes and configurations that led me to the result I have now, but anyway — it’s some permutation of what I have recorded here.
One of the things that I found most frustrating today was that I had to continually add my personal password every time I ran the compand start-all.sh, it was a big relief when I found out about this neat trick:
> eval `ssh-agent -s`Agent pid 13674> ssh-addIdentity added: /Users/s.matthew.english/.ssh/id_rsa (/Users/s.matthew.english/.ssh/id_rsa)> ssh localhostLast login: Tue Sep 5 18:34:26 2017 from 127.0.0.1
That allowed me to obviate the need to enter my password all the time, as you can see here:

Not sure if this is useful information (to our current task) or not, currently trending towards not, but if you want to know the hostname of your machine on macOS, you can figure it out like this:
> hostname
> Matthews-MacBook-Pro.localSo then, it seems like — after your configure your setting files as I’ve demonstrated above, this following sequence is what you need to execute the `Pseudo-Distributed Operation` demo.
hdfs dfs -mkdir /user/s.matthew.englishOutput contents of hdfs:
hadoop fs -ls /This command is unsettling, as it ends in the namenode shutting down with much fanfare, but — seems it’s doing the right thing:
hdfs namenode -formatSeems to be how we kick things off:
start-all.shOr maybe you can get away with just running this:
start-dfs.shAt first I thought we needed to add the configuration files, and that they needed to be arranged in such a way as to facilitate proper execution of our application — but now that I’ve read more about how this demo actually works — it seems that the demo is doing a grep operation over the files, and what is written in them isn’t important. Anyway, here’s the first command I tried to load them into our Hadoop file system:
hdfs dfs -put /usr/local/Cellar/hadoop/2.8.1/libexec/bin/hadoop /Users/s.matthew.english/ConsenSys/PegaSys/hadoop_by_example/inputThat didn’t work, so I cut it down to this:
hdfs dfs -put /usr/local/Cellar/hadoop/2.8.1/libexec/bin/hadoop inputThe above command is also wrong, because we need to put the input files, don’t need to reference the Hadoop executable just yet. So then, should be like this:
hdfs dfs -put /Users/s.matthew.english/ConsenSys/PegaSys/hadoop_by_example/input inputSubsequently, it could be that we get some reasonable output from this command:
hadoop fs -ls /user/s.matthew.english/inputFinally, the coup de grâce, execute the demo with this one:
hadoop jar /usr/local/Cellar/hadoop/2.8.1/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar grep input output 'dfs[a-z.]+'The partial output, cleaned up a bit to get rid of verbosity, looks like this:
17/09/05 18:55:40 INFO mapreduce.Job: Counters: 35File System CountersFILE: Number of bytes read=604352FILE: Number of bytes written=1271045FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=13028HDFS: Number of bytes written=137HDFS: Number of read operations=13HDFS: Number of large read operations=0HDFS: Number of write operations=4Map-Reduce FrameworkMap input records=169Map output records=4Map output bytes=71Map output materialized bytes=45Input split bytes=115Combine input records=4Combine output records=2Reduce input groups=2Reduce shuffle bytes=45Reduce input records=2Reduce output records=2Spilled Records=4Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=0Total committed heap usage (bytes)=714080256Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format CountersBytes Read=6514File Output Format CountersBytes Written=137
Here’s some additional output that looked interesting:
File System CountersFILE: Number of bytes read=1208878FILE: Number of bytes written=2542231FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=13302HDFS: Number of bytes written=297HDFS: Number of read operations=39HDFS: Number of large read operations=0HDFS: Number of write operations=16Map-Reduce FrameworkMap input records=2Map output records=2Map output bytes=35Map output materialized bytes=45Input split bytes=142Combine input records=0Combine output records=0Reduce input groups=2Reduce shuffle bytes=45Reduce input records=2Reduce output records=2Spilled Records=4Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=0Total committed heap usage (bytes)=969932800Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format CountersBytes Read=137File Output Format CountersBytes Written=23
Here’s some information from the Hadoop dashboard:
http://localhost:50070/dfshealth.html#tab-datanode


