Bulk Insert on ElasticSearch 7.1 using Java High Level Rest Client

Pankaj Kumar Singh
May 30 · 5 min read

This article is focused towards a real world application of ElasticSearch that we will come across.

Problem Statement: Bulk Inserting of data records from a .tsv file using Java High Level Rest Client

Some theory first —

ElasticSearch- As rightly mentioned on this link https://qbox.io/blog/what-is-elasticsearch , Elasticsearch is an open-source, broadly-distributable, readily-scalable, enterprise-grade search engine. Accessible through an extensive and elaborate API, Elasticsearch can power extremely fast searches that support your data discovery applications. Imagine it as a warehouse or store of documents but in a NoSql format.

Java High Level REST Client works on top of the Java Low Level REST client. Imagine it as a layer on top of your Low Level Client. It makes available API specific methods that accept request objects and in turn provide responses. We can perform CRUD(Create, Read, Update, Delete) operations from the High Level Rest Client on our ElasticSearch Server.

Steps:

Step 1- Setup ElasticSearch(ES) 7.1 with jdk version 8.

One can find plenty of articles on setting up ElasticSearch 7.1 and also installing jdk version 8, hence I won’t be explaining it here. Follow the link to install: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-install.html

Start the ES server by running bin/elasticsearch . Our Cluster will be available on

http://localhost:9200

Step 2- Add ES extension to Chrome for easy visualization.(Optional)

Its upto you to install the plugin. For reference I have attached the plugin image. Once installed click on the extension button and a new tab with the cluster health will be visible.

Step 3- Setup Intellij for writing our Java code (Optional)

Follow the link for installing: https://www.javahelps.com/2015/04/install-intellij-idea-on-ubuntu.html

Step 4- Start a new project

Create a simple java project on IntelliJ. You should get a folder structure like the following.

Inside the src/main/java folder of our java project create a new java class file. You can name it whatever you like, for example “BulkUpload.java”.

Add dependecies to build.gradle file by following the below format

In the above gist we are adding dependencies for

  • ElasticSearch Rest High Level Client
  • ElasticSearch
  • ElasticSearch-Rest-Client
  • Log4j

Step 5- Time to Code…

In the BulkUpload.java file add the imports for our code to work, alternatively we can add it later as well when our IntelliJ throws errors.

Adding variables to our class which will be used throughout the class.

Lets define the main function :

makeConnection() function :

closeConnection() function :

bulkProcessor() function :

Okay the code is a bit lengthy to absorb all at once, not to worry, i’ll explain what we are doing here.

  • We start with reading out .tsv file first
  • Reading each line from the .tsv file to extract out keys and values in the form of dataRows.
  • Breaking down the dataRows to individual tokens using String Tokenizer and storing them into the keyList and the valueList Arrays.
  • Calculating the number of rows and columns that we need to process based on our keyList and valueList Array Size.
  • Creating a new Index with some mapping properties we would want to define. Go through the following link to know more about the properties applied in the code. https://www.elastic.co/guide/en/elasticsearch/reference/current/norms.html
  • Then we start creating HashMaps based on the keyList and the valueList. This HashMap will later be used during the bulk insert.
  • We check if we have exceeded the batch size defined earlier. If yes , we create a new BulkRequest which will speed up the bulk writes.
  • Finally we run the code and we can see index “test” being populated with our rows.

You can use the sample .tsv file from the following link for testing: http://opensource.indeedeng.io/imhotep/docs/sample-data/

This is how the sample data will be:

Result: Click on Browser tab to see the records as per the index

Hope the article was easy enough for beginners in elasticsearch to understand the flow. This is one of the use cases of elasticsearch in the industry.

There are a lot of other optimizations that can be done with the above code. So let me know if there is something wrongly explained or if you have any suggestions.

Have you read my other articles?




Hi, I am Pankaj Kumar Singh. A Software Engineer, Developer and Infosec Enthusiast . If you find any issues regarding the post feel free to reach out to me. Will be happy to resolve any issues. You can find me on Linkedin and GitHub. Or just drop a mail to singhpankajkumar65@gmail.com.

Pankaj Kumar Singh

Written by

Software Engineer | Developer | Infosec enthusiast