Spark 2.x External Packages

Image for post
Image for post
Photo by Mika Baumeister on Unsplash

The bane of using bleeding edge technology is very less or hidden information of new features in the latest version. We at Unnati use bleeding edge releases of many data science tools for various research and production systems. In this post we explain how to add external jars to Apache Spark 2.x application.

Starting Spark 2.x, we can use the --package option to pass additional jars to spark-submit. Spark will look through the local ivy2 repository for the jar, if it is missing, it will pull the dependency from the central maven server.

$SPARK_HOME/bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.10:2.0.0 <py-file>

In the above example, we are adding mongodb-spark connector. This works perfectly fine. However, there are scenarios where spark is used as part of the python application. In this case, we will use SparkContext to specify the configuration.

We need to use the spark-defaults.conf to specify the external jar. Add the following to the file

spark.jars.packages               org.mongodb.spark:mongo-spark-connector_2.10:2.0.0

Now run your pyspark application as usual

python <py-file>
Image for post
Image for post

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store