Getting started with GridDB connector for Hadoop MapReduce
The GridDB Hadoop MapReduce Connector allows you to use GridDB as the data storage engine for Hadoop MapReduce applications with a few small changes to their source code. In this post we’ll take a look at how to install and use GridDB’s Hadoop HDFS (Hadoop Distributed File System) Connector.ELB
Starting from a fresh CentOS 6.8 image, we first need to install the Oracle JDK, Hadoop (from Bigtop), and GridDB.
Download the Linux x64 JDK RPM from Java SE Development Kit 8 — Downloads and install it using RPM.
curl http://www.apache.org/dist/bigtop/bigtop-1.1.0/repos/centos6/bigtop.repo > /etc/yum.repos.d/bigtop.repo yum -y install hadoop-\*
Ensure your /etc/hosts has an entry for your machine’s hostname like so:
Since we’re using an Azure VM to test, we also need to mount the local SSD as Hadoop’s data directory.
# mkdir /mnt/resource/hadoop-hdfs
# chown -R hdfs.hdfs /mnt/resource/hadoop-hdfs
# mount -o bind /mnt/resource/hadoop-hdfs/ /var/lib/hadoop-hdfs/
Now you can format the namenode and start the Hadoop HDFS services:
# sudo -u hdfs hadoop namenode -format
# service hadoop-hdfs-namenode start
# service hadoop-hdfs-datanode start
Now we need to create all of the HDFS directories as required (From How to install Hadoop distribution from Bigtop 0.5.0):
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -mkdir -p /user/history
sudo -u hdfs hadoop fs -chown mapred:mapred /user/history
sudo -u hdfs hadoop fs -chmod 770 /user/history
sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging
sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging
For each system user that will have access to HDFS, create a HDFS home directory:
sudo -u hdfs hadoop fs -mkdir -p /user/$USER
sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER
sudo -u hdfs hadoop fs -chmod 770 /user/$USER
Now we can start YARN and test the cluster.
# service hadoop-yarn-resourcemanager start
# service hadoop-yarn-nodemanager start
My favourite Hadoop test is a wordcount of Java’s documentation.
$ hdfs dfs -put /usr/share/doc/java-1.6.0-openjdk-22.214.171.124/
$ time yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount java-1.6.0-openjdk-126.96.36.199 wordcount-out
$ hdfs dfs -get wordcount-out
We will install GridDB via the instructions in the GridDB Community Edition RPM Install Guide. If you’re using GridDB SE or AE, the details may be slightly different. Please refer to the GridDB SE/AE Quick Start Guide for more information.
# rpm -Uvh https://github.com/griddb/griddb_nosql/releases/download/v3.0.0/griddb_nosql-3.0.0-1.linux.x86_64.rpm
We’re going to use a profile script to set the GridDB environment variables, as root create /etc/profile.d/griddb_nosql.sh:
You can log out and then back in and the settings will be applied.
Now you create the GridDB password for admin.
# sudo su - gsadm -c "gs_passwd admin"(input your_password)
The default gs_cluster.json and gs_node.json configuration files are fine for single node usage except that you will need input a name for the cluster in gs_cluster.json.
Like with Hadoop, since we’re using Azure we need to mount the local SSD as GridDB’s data directory:
mkdir -p /mnt/resource/griddb_data
chown -R gsadm.gridstore /mnt/resource/griddb_data
mount -o bind /mnt/resource/griddb_data /var/lib/gridstore/data
Now start the cluster and confirm the node is ACTIVE in gs_stat:
sudo su - gsadm -c gs_startnode
sudo su - gsadm -c "gs_joincluster -c defaultCluster -u admin/admin -n 1"
sudo su - gsadm -c "gs_stat -u admin/admin"
You’ll need Maven to build the Connector, and the easiest way to get it on CentOS/RHEL 6.8 is by downloading the binaries from Apache’s website:
tar zxvf apache-maven-3.3.9-bin.tar.gz
Since you only need it for the initial build, we just manually set the path rather than making a profile script.
Download and Build the GridDB Hadoop Connector
You can either use git to clone the GitHub repository with:
$ git clone https://github.com/griddb/griddb_hadoop_mapreduce.git
Or download the zip file and unzip it:
$ wget https://github.com/griddb/griddb_hadoop_mapreduce/archive/master.zip
$ unzip master.zip
Now build the Connector:
$ cd griddb_hadoop_mapreduce-master
$ cp /usr/share/java/gridstore.jar lib
$ mvn package
Using the Connector
One example (wordcount) is provided with the GridDB connector. It can be run with the following command:
$ cd gs-hadoop-mapreduce-examples/
$ ./exec-example.sh --job wordcount \
--define clusterName=$CLUSTER_NAME \
--define user=$GRIDDB_USER \
--define password=$GRIDDB_PASSWORD \
<list of files to count>
So how does performance compare? Well, for the above small example, HDFS typically takes 36 seconds to complete the job while GridDB takes 35 seconds. Of course the test is so small that that the results are meaningless so look forward to a future blog post that does a proper performance comparison across a variety of configurations. Also coming soon is a look at how to take an existing MapReduce application and port it to use the GridDB Hadoop Connector.
Originally published at griddb.net.