MongoDB and PySpark 2.1.0

Photo by Thomas Kvistholt on Unsplash

The common problem with using the latest release of any framework is that there are no or very few adopters, docs are not updated or point to older versions. We encountered a similar problem while integrating MongoDB driver with Apache Spark 2.X. Majority of the library docs available as of today work only with spark 1.5+.

All we wanted to do was to create a dataframe by reading a mongodb collection. After a lot of googling, we figured out there are two libraries that support such operation:

We decided to use go ahead with the official Spark Mongo connector as it looked straightforward.

With spark 2.X, we can specify the third party package / library in the command line for spark to add it as a dependency using the packages option. Spark checks if the given dependency is resolved, else it pulls the library from the central maven repository.

Starting Spark 2.X, we do not need to create SparkContext and SQLContext to read a dataframe. We can use SparkSession API to easily read a dataframe or perform any SparkSQL operation.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“demography mapper”).getOrCreate()

To read a data frame from a mongodb collection, we need to specify a couple of options telling spark the data format and the collection details

df_user = spark.read.format("com.mongodb.spark.sql.DefaultSource")\    .option("spark.mongodb.input.uri", "mongodb://localhost:27017/raw.user").load()

This is now a standard spark data frame and we can use any of the dataframe operations or SQL operations on it.

You can find the link to the complete script here — sample demography mapper