Using Cloud Dataflow to index documents into Elasticsearch

Sameer Abhyankar
Aug 17, 2018 · 3 min read

You have a lot of data sitting in your system but your customers still keep complaining that they cant find anything! Sounds familiar? If so, perhaps it is time to add a scalable, distributed search engine such as Elasticsearch to your tech stack.

Elasticsearch is pretty easy to deploy on the Google Cloud Platform and you can set it up pretty quickly. The next step is to figure out how to index — or make searchable — all this data that you are adding into your environment.

Enter Cloud Dataflow!

Cloud Dataflow is a fully managed service to execute a data pipeline (batch or streaming) written in Apache Beam. Luckily, Apache Beam’s Java SDK provides a handy connector for writing (and reading) documents to Elasticsearch. You could do things like validating the data that you want to index or even enhancing it with some external data sets — All this as part of the same data pipeline.

This blog and the accompanying sample code on Github will attempt to put the pieces of the puzzle together and demonstrate how all this would work as part of a single pipeline running on Cloud Dataflow.

Dataflow pipeline to index documents into Elasticsearch

There is a lot going on in the above diagram and so the following steps drill down into how all these components work together:

  1. Cloud Pub/Sub is an excellent way to stream data into the environment and Cloud Dataflow can easily read from Cloud Pub/Sub making it the data streaming tool of choice in this approach.
  2. Cloud Bigtable is ideal for ad-hoc low latency lookups and is great for storing any external data needed to enhance documents in flight.
  3. Using Apache Beam’s ElasticsearchIO Java connector makes it very easy to write (index) documents into Elasticsearch.
  4. Documents that fail validation might need some manual (or automated) intervention before being reprocessed. In this approach, these documents are published to a different Cloud Pub/Sub topic.
  5. Optionally, the process of correcting the errors and making the failed documents available for reprocessing can be automated — With Cloud Dataflow of course!

As an added benefit, autoscaling can be easily enabled for a streaming pipeline by setting a couple of flags:

--autoscalingAlgorithm=THROUGHPUT_BASED
--maxNumWorkers=<N>

This allows Cloud Dataflow to adjust the number of workers based on the amount of backlog it needs to process — Only use resources when needed!

Great! But how do we address the data that already exists in our system?

A simple modification to the pipeline to read the input from Google Cloud Storage bucket instead of Cloud Pub/Sub should be the only thing that would need to change in the pipeline.

So can we use this approach with the latest and greatest release of Elasticsearch?

Alllmost…Not quite..

Well — Almost :) Currently Apache Beam’s ElasticsearchIO supports Elasticsearch v5.x (and v2.x) out of the box. However there is active work in the Apache Beam community to support Elasticsearch v6.x as well. An interim solution would be to use a custom transform to talk to Elasticsearch 6.x using their Java High-Level REST client.

Ready for a working sample?

Getting started with this pipeline and sample data for a quick demo (who doesn’t like a demo!) cannot be easier — Point your browser to the accompanying Github repo and follow along!

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Thanks to Daniel De Leo.

Sameer Abhyankar

Written by

Google

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.