Setup elasticsearch Hadoop plugin for Apache Spark

The elastic company has made a small but powerful library to read/write elasticsearch data with Hadoop techno such as:

  • Apache Spark
  • Apache Hive
  • Pig
  • Map Reduce

Powerful, but (I found) very hard to setup correctly because Java dependencies and Hadoop distributed work nature. This small post show how I deal with Spark + elastic4Hadoop libraries in order to run code against Spark on YARN.

Ho my cluster!

What is a workers cluster?

You run your program (Java code, SQL query) via a “master” that will dispatch/schedule “workers”. So workers must have all dependencies!

With Java, you have 2 choices:

  1. Uber JAR: a jar with your code + dependencies
  2. Small JAR: a jar with your code + you provide dependencies at runtime (via HDFS for instance)

Let’s build a small Uber JAR for Spark

See my pom.xml:

With Maven shade plugin, we include only elasticsearch dependencies.

Pay attention to not reduce the final JAR, else some classes needed at run-time will be not found :/

This way, you can submit Spark job without elastic4Hadoop.jar.