Importing data from MySql to Hive using Sqoop
Ok, on previous story, we learned how to extract information from MySql using Hadoop’s ecosystem tool called Sqoop. But this sent all the information from the database to a file inside HDFS. We need to move that information into hive to be able to exploit it, to mine it.
We assume that you have a hadoop cluster and another VM with MySql.
We tried two options:
1) Manually using Sqoop CLI to download data from MySql into HDFS and then Beeline CLI to load the data into Hive.
2) Download data from MySql into Hive directly through Sqoop.
First option, remember: Sqoop-Beeline manually. We already had the data on a file on hdfs. Sqoop placed the information in the directory day3dir in a file named part-m-00000 (well, actually Sqoop was told where to place it). day3dir directory is inside the /user/root directory. Here is the content of the directory:
To run Hive CLI, you execute in terminal the “hive” command. But, the Hive CLI is soon to be deprecrated, so we switch to “beeline”. So what is the difference between both CLIs? Found this (https://community.hortonworks.com/questions/50547/connecting-hive-beeline-vs-hive.html):
Essentially beeline would use the JDBC/Thrift (or alternatively HTTP) protocol to communicate with HiveServer2.
HiveServer2 then handles the hive logic (finding the table definition in the metastore, reading the data from HDFS, etc).
On the other hand, the hive shell access directly the Hive metastore and HDFS data, and bypasses HiveServer2.
The big influence that this difference can have in your situation is security.
Hive security is implemented in HiveServer2, thus Hive shell bypasses any Hive access policies you might have set on specific databases using Ranger or SQL based authorization (only HDFS policies apply in this case).
The command line to run beeline is like this:
beeline -u jdbc:hive2://localhost:10000/default -n user -w password
This is the command we used to create the table in Hive, using Beeline:
Next, of course, was to load the data into the Hive table:
Ok, so this worked without any major problems. But…while reviewing the data inside the table in Hive (through a simple “Select id_str from twext2 limit 10” for just a few rows), noticed that the information was not properly arranged in its defined column, ie. the tweet id_str (which IS a big numeric value) had null values for some rows, even the tweet text was the value for other rows ! So what happened? Hive didn’t do any wrong, the problem is with the data file itself. Sqoop outputted the data as a simple text file separeted by “,” (comma). What happened was that the text in Mysql contained special characters (/r /n etc), and Hive does not like these characters.
So what to do? Well, Sqoop can format the output. But I could not find how to do it from Beeline itself.
So, back to documentation, and find out how with Sqoop (http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_syntax)
And of course, this was the end of the first option, with a big F.A.I.L. (First Attempt In Learning).
Move to second option.
Second, the straight way: let Sqoop import directly into Hive all the information needed. It is completely straightforward, including the option of filtering out the special characters and having the information properly put into the Hive table:
$ sudo sqoop import --connect jdbc:mysql://ipaddress:3306/trvision --connection-manager org.apache.sqoop.manager.MySQLManager --username user --password password --table tweets --hive-import --hive-table twext --hive-drop-import-delims --m 1
This worked like a charm and very fast. These were the option used in sqoop.
— hive-import tell sqoop to import into hive
— hive-table points to the table where data is to be put into.
— hive-drop-import-delims drop the special characters that Hive does not like.
Sqoop lives to its description to the dot: “Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.”
Still, there is some previous background that might make their progress a bit simpler if read/learned beforehand:
Linux commands (lots of this on the web, google it!).
HDFS commands (this is a link to neat & simple instructions): https://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/