Cloudera Hadoop installation and Sqoop “Hello World” first try.

Now moving on to more interesting things in our planned Big Data and Machine learning.

Found these videos from Cloudera explaining what Hadoop is and all of its ecosystem, very well done, simple to understand:

In order to test hadoop on a working environment, first order was to install it.

Well, after some research, found that Cloudera Manager 5.12.0 (https://www.cloudera.com/downloads/manager/5-12-0.html) is a simple and complete option. It includes all (configurable) of hadoop’s ecosystem. It comes in two versions: Cloudera Enterprise or Cloudera Express. Express is free but limited, but the Enterprise has a demo 60 day-free version. Installation is almost a breeze.

But there are some minimum requirements, like having the machines with a 8GB of memory (or more); At least 16GB of disk space. Found out the ‘hard way’, since installation was done on a AWS EC2 t2.micro instances, and of course, it did not cover the minimum requirements, installation crashed.

After many tries, we got it working on a t2.large instance. These are the step we took:

  1. Download via terminal the cloudera manager installer (wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin)
  2. Change permisions (chmod u+x cloudera-manager-installer.bin)
  3. Execute ! (sudo ./cloudera-manager-installer.bin)
  4. You will be taken through several windows and asked to confirm licensing and stuff…its pretty basic.

Once installation is completed, you must give a few minutes for Cloudera Manager to start up, you easily see if its ready by going to your web browser http://yourhadoopmainnodeip-name:7180. If its running you get to see the login window:

Cloudera Manager login window

So now you are in business. There more to setup and configure, but it is all done from the Cloudera Manager web browser window. Quite simple and its mostly default selections.

Once finished, you have the whole ecosystem (services) running:

My health issue is for the simple reason that hadoop is running on a single server and cannot comply with the minimum 3 copies of each block of data of three different services.

Once we have all of this running, it was time to feed some information into our hadoop system. This is done by ingesting data, and a simple tool will be to use Sqoop: http://sqoop.apache.org

But, if things were so simple…

In order to test Sqoop, a simple command, a ‘Hello World’ equivalent, would be to list the databases contained in my other VM, where TrVision’s mysql database is stored.

This is the basic command to list databases in the mysql server “myservernameaddress” :

sqoop list-databases — connect jdbc:mysql://myservernameaddress/ — username user -P

First issue. According to Cloudera documentation (https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_jdbc_driver_install.html) we have this piece of information:

Sqoop 1 does not ship with third party JDBC drivers. You must download them separately and save them to the /var/lib/sqoop/ directory on the server. The following sections show how to install the most common JDBC Drivers.

So, what to do? In this case, since our database systems is MySQL, go to this page (https://dev.mysql.com/downloads/connector/j/5.1.html) and download the driver.

Did download to my machine, then transferred file to server with this command: scp -i keyfile username@source:/location/to/file username@destination:/where/to/put

Next, decompress file: tar -xvf mysql-connector-jav-5.1.42.tarCheck your version of the driver, it could be different!

Moving on, now copy .jar file driver to the following directories: /var/lib/scoop/ & /opt/cloudera/parcels/CDH-5.12.0–1.cdh5.12.0.p0.29/lib/sqoop/lib/. Here is the commands I used:

sudo cp mysql-connector-java-5.1.42/mysql-connector-java-5.1.42-bin.jar /var/lib/sqoop/

sudo cp mysql-connector-java-5.1.42/mysql-connector-java-5.1.42-bin.jar /opt/cloudera/parcels/CDH-5.12.0–1.cdh5.12.0.p0.29/lib/sqoop/lib/

Finally !! Now we can run scoop. For me, only this worked:

sqoop list-databases -connect jdbc:mysql://my.ip.addr:3306 -connection-manager org.apache.sqoop.manager.MySQLManager -username USER -password PASSWORD

You will have to supply server IP address/name, Mysql valid user and Mysql valid password.

Viola !

Ok, sqoop responded with a list of databases set up in my other VM in AWS. Moving on more stuff.

Now to get some of that info into hdfs. This is the command for scoop:

$ sqoop import (generic-args) (import-args)

This is what I’m doing:

$ sudo sqoop import -connect jdbc:mysql://ipaddress/trvision -connection-manager org.apache.sqoop.manager.MySQLManager -username user -password my password -table tweets -m 1 — target-dir /mytargetdir

This should send all the info from my MySql database “trvision” table “tweets” into /mytargetdir in my hadoop cluster. At least that is the theory.

Well, the job ‘hanged’, so after some research, this was pointed out to me:

Just make sure that yarn.scheduler.minimum-allocation-mb is not larger than yarn.scheduler.maximum-allocation-mb in your Yarn configuration. Yarn will not start.

Also, check that yarn.nodemanager.resource.memory-mb is set to 8Gib and yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentageset to 95%.

Now, going back to my command, it returned (last lines):

17/07/26 21:30:27 INFO mapreduce.ImportJobBase: Transferred 35.2226 MB in 22.9407 seconds (1.5354 MB/sec)

17/07/26 21:30:27 INFO mapreduce.ImportJobBase: Retrieved 187092 records.

Which means that the info was transferred!

Success at last, yay !

So what did we learn?

  • Installing and using parts of Hadoop ecosystem is not an easy task.
  • Cloudera Manager installation was quite simple and almost automatic. To some point this is frustrating because you have no idea what is going on.
  • One tip I would suggest to Cloudera would be that if I am installing in a machine that does not comply is hardware, it should tell me from the beginning, not after the crashes because of requirements. Although there is a minimum requirement section where it specifies this…
  • Some of my issues on installation are basic, so maybe a thorough help guide would be appreciated by many (like me).
  • linux/unix command knowledge. There is some necessary knowledge of linux commands (cd, pwd, ls, mkdir, sudo, chmod, sip, tar, cp). So if you do not know these commands, well, things will not be easy.
  • Bumping into trouble always makes you learn better and deeper how things work. Nobody like trouble, but nothing beats experience.

Wish list:

Found that the hadoop ecosystem is unstable to a certain point. While try to run a sqoop command that had worked previously, not it did not. Why? Nothing had changed. Did a full restart of Cloudera Manager and that ‘fixed’ the issue. Did not dig deeper, but these type of problems are a big headache. So it would be better to fix these issues.

Next, we will create a Hive table with data we just imported. But that is another story !

All for now.