Spark 2.x External Packages

Photo by Mika Baumeister on Unsplash

The bane of using bleeding edge technology is very less or hidden information of new features in the latest version. We at Unnati use bleeding edge releases of many data science tools for various research and production systems. In this post we explain how to add external jars to Apache Spark 2.x application.

Starting Spark 2.x, we can use the --package option to pass additional jars to spark-submit. Spark will look through the local ivy2 repository for the jar, if it is missing, it will pull the dependency from the central maven server.

$SPARK_HOME/bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.10:2.0.0 <py-file>

In the above example, we are adding mongodb-spark connector. This works perfectly fine. However, there are scenarios where spark is used as part of the python application. In this case, we will use SparkContext to specify the configuration.

There is no way to set packages option using SparkConf

We need to use the spark-defaults.conf to specify the external jar. Add the following to the file

spark.jars.packages               org.mongodb.spark:mongo-spark-connector_2.10:2.0.0

Now run your pyspark application as usual

python <py-file>