MongoDB and PySpark 2.1.0

Image for post
Image for post
Photo by Thomas Kvistholt on Unsplash

The common problem with using the latest release of any framework is that there are no or very few adopters, docs are not updated or point to older versions. We encountered a similar problem while integrating MongoDB driver with Apache Spark 2.X. Majority of the library docs available as of today work only with spark 1.5+.

All we wanted to do was to create a dataframe by reading a mongodb collection. After a lot of googling, we figured out there are two libraries that support such operation:

We decided to use go ahead with the official Spark Mongo connector as it looked straightforward.

With spark 2.X, we can specify the third party package / library in the command line for spark to add it as a dependency using the packages option. Spark checks if the given dependency is resolved, else it pulls the library from the central maven repository.

Starting Spark 2.X, we do not need to create SparkContext and SQLContext to read a dataframe. We can use SparkSession API to easily read a dataframe or perform any SparkSQL operation.

To read a data frame from a mongodb collection, we need to specify a couple of options telling spark the data format and the collection details

This is now a standard spark data frame and we can use any of the dataframe operations or SQL operations on it.

You can find the link to the complete script here — sample demography mapper

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store