Setup elasticsearch Hadoop plugin for Apache Spark
The elastic company has made a small but powerful library to read/write elasticsearch data with Hadoop techno such as:
- Apache Spark
- Apache Hive
- Map Reduce
Powerful, but (I found) very hard to setup correctly because Java dependencies and Hadoop distributed work nature. This small post show how I deal with Spark + elastic4Hadoop libraries in order to run code against Spark on YARN.
Ho my cluster!
What is a workers cluster?
You run your program (Java code, SQL query) via a “master” that will dispatch/schedule “workers”. So workers must have all dependencies!
With Java, you have 2 choices:
- Uber JAR: a jar with your code + dependencies
- Small JAR: a jar with your code + you provide dependencies at runtime (via HDFS for instance)
Let’s build a small Uber JAR for Spark
See my pom.xml:
With Maven shade plugin, we include only elasticsearch dependencies.
Pay attention to not reduce the final JAR, else some classes needed at run-time will be not found :/
This way, you can submit Spark job without elastic4Hadoop.jar.