Indexing into Elasticsearch using Spark — code snippets

Here at Innoplexus, we have recently taken upon ourselves the task to create the largest search over Life Sciences data. We’ve been using Elasticsearch for our search requirements in many of our earlier products but this time the challenge was to scale our existing capabilities to search over Terabytes, if not Petabytes, of raw data. We were hit hard when we tried using our traditional Python scripts to index this huge amount of data into ES. We reckoned that we need to distribute the job of indexing this data — else it would take us weeks, or even months, to index even a single collection of our data sources.

Image for post
Image for post

When we decided to write our “scalable” Elasticsearch indexer the technology that we chose to have at its core was Apache Spark due to the various flexibilities and ease of task distribution it provides. I started searching for Spark-Elasticsearch connectors and soon realised that I had entered a region with quite poor documentation and there was nothing readily available. Hence, now that I have shipped a few stable version of our indexer, I would like to compile and share the minimalistic code required to stitch these technologies and make it work!

During the journey I kept improving the stack we were using and there are 3 snippets that I have to share:

  • Indexing data into Elasticsearch via Python through Spark RDDs
  • Indexing data into Elasticsearch via Python through Spark DataFrames
  • Indexing data into Elasticsearch via Scala through Spark DataFrames

These snippets can be used in various ways including spark-shell, pyspark or spark-submit clients. One thing that is common among these snippets is that they (necessarily) require the Elasticsearch-Hadoop jar file to run. ES-Hadoop is a way to connect ES with various Hadoop services and the below snippets effectively only use the Spark-ES connector. For instance, to use it with pyspark the command would be pyspark --jars elasticsearch-hadoop-5.6.4.jar --driver-class-path elasticsearch-hadoop-5.6.4.jar

Indexing via PySpark RDDs

Open a PySpark shell or use spark-submit to run this code.

Indexing via PySpark DataFrames

Open a PySpark shell or use spark-submit to run this code.

Indexing via Scala DataFrames

Open a Spark shell or use spark-submit to run this code.

I would like to thank Gaurav Tripathi, Prashant Bhatwadekar and Manish Kumar Pal for their constant help while creating this module. :)

Few other resources that might be useful:
https://stackoverflow.com/questions/31410608/does-spark-not-support-arraylist-when-writing-to-elasticsearch/50942356#50942356
https://stackoverflow.com/questions/46762678/how-to-push-a-spark-dataframe-to-elastic-search-pyspark/52199097#52199097

Written by

Explorer. Enthusiast.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store