Spark, Scala & Hive Sql simple tests

Continuing the work on learning how to work with Big Data, now we will use Spark to explore the information we had previously loaded into Hive.

Cloudera installation does not install Spark by default, but adding the service is a breeze.

Just open your dropdown button of your cluster and the very first option is Add Service, click it, and it will take you to another page where you can select which services to add, Spark will be there.

Ok, before going into Spark with Hive info, since this is our first try, it is important not to try to run before we are sure we can walk. So, lets try an equivalent to “Hello world” (this should be an *always* when starting to code in new or different languages).

First things first. What are we going to analyze? We need a file that Spark will process. Create a text file with information, for this case words, will be counted. Here is my file content (inputfile.txt):

Need to place this file inside the hadoop file system:

$ hdfs dfs -put inputfile.txt

Now that we have information that Spark will analyze, we can create a program to do something with the information. This is the example we followed:

Go to the ‘Self-Contained Applications’ section.

This is the sample scala code:

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}

Place this file in: $ SPARKDIRECTORY/src/main/scala/SimpleApp.scala. Will you use git to control this project? Its up to you but very recommendable. Of course, you can download this project from my github: https://github.com/eday69/SparkWordCount

“YOUR_SPARK_HOME” directory is based of the hdfs file system structure. So careful there. When you put into hdfs, check where the file was put or place it in an specific directory.

Also, found that the sbt package is not installed either. What is sbt?

sbt (Simple Build Tool) is an open source build tool for Scala and Java projects, similar to Java’s Maven and Ant. Its main features are: Native support for compiling Scala code and integrating with many Scala test frameworks. Build descriptions written in Scala using a DSL.

So need to install it:

$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
$ sudo apt-get update
$ sudo apt-get install sbt

Which take a little while, but it works. Need to create .sbt file that will located at the base of the project directory. My name was sprwrdcnt.sbt:

ubuntu@ip-172–31–29–164:~/spark-application$ cat sprkwrdcnt.sbt
name := “Spark Word Count”
version := “1.0”
scalaVersion := “2.10.4”
libraryDependencies += “org.apache.spark” %% “spark-core” % “1.2.0”

Issue command, it will create the project structure and classes need to run Spark.

$ sbt package

Now we have all the element ready to call Spark. Just change the name of your class and your app jar name.

$ spark-submit — class “SimpleApp” — master local[4] target/scala-2.10/spark-word-count_2.10–1.0.jar

Found that I had wrong the path to the text file, so, of course Spark was not finding it, changed it; but it is necessary to re-execute command $ sbt package to regenerate files.

If you look closely at the stdout results, you will see that the job was completed satisfactory. I’ve copied just the lines with the results we were looking for:

17/08/06 23:08:42 INFO scheduler.DAGScheduler: Job 1 finished: collect at SparkWordCount.scala:28, took 0.168665 s
(p,3), (x,1), (t,2), (b,1), (h,2), (n,5), (f,1), (v,2), (r,4), (l,1), (s,3), (e,9), (a,5), (i,2), (u,2), (o,4), (g,1), (m,1), (c,1)

Now we move on. We will create another directory called spark-application2 this time. And create another Scala program. This time, instead of reading from a file, we will try to read from a Hive SQL table. Again, using git to control project.

Next we compile the project: sbt package

A now run the job:

$ spark-submit — class “SpHiveApp” — master local[4] target/scala-2.10/sparkhive-sql-tester_2.10–1.0.jar

Did it run? No. Why?

Many reasons, and after much research to find errors (and mistakes), this is what I can explain:

  1. CDH 5.12 does not include Spark 2.0, it has Spark 1.6. And why is this important you may ask, the answer is because Spark 1.6 does not support SparkSession, so you need to work the Scala program a different way.
  2. Dependencies and versions are very important, and you must be consistent on this throughout the dependencies.
  3. A bit obviuos, but it did happen to me, make sure the Hive and Spark ARE running on your server.
  4. Another, obvious to some, not obvious to me, was the .sbt config file. I had two config files in the same folder. Did the mistake of working with one and ‘thinking’ that sbt would not pay attention to the other one, another big mistake from my part.

So what did we change?

First, we eliminated the config file (.sbt file) that was not being used. Then changed the config file to this:

One change you can see I did, was the creation of “val sparkVersion = 1.6.0”. And use it with each library dependency. This way, we can be sure that we are being consistent in all dependencies.

Changed our Scala program to this:

You can see that the program is much simpler. It is only a simple query to a Hive table, selecting only 10 records from it since it has many.

This is part of the result, where you can the results of the query:

You can find the code for this example in my github:

What are we trying to do? Why go through all this? Good question. Remember, we are working Big Data now, lots of information. Relational tables will not do it, so we are using Hadoop to be able to manage large volumes of info. Hive & Spark will help us with this.

We will build a recommendation engine with Spark in Scala. That is the goal.

So far, the Scala we have used is simple and there is plenty of documentation on that, but the errors we had were very confusing to solve since info on them was not clear, at least to me.

All for now, see you in the next chapter.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.