Taming Elasticsearch to Load Large Custom JSON Datasets
Preface
Elasticsearch is an open source, broadly accessed, readily scalable, enterprise-grade search engine. Accessible through an extensive and elaborate API, it supports extremely quick search commands on a large amount of data powering discovery applications.
We build an application that would search a large amount of GIS data to highlight geographical features such as roads, paths, lanes for a given location. The GIS data was made available in a geoJSON (a JSON format ) [for more details refer : ‘http://geojson.org/’ ] format. The geojson data comprise of road specific information (highway, city road, one-way, paved/unpaved, latitude, longitude etc). The objective was to import this information into an Elasticsearch instance in the most efficient way and then power an API library that would provide all the coordinates of a road for a search query. For example, if a user gives a name of a road, the elasticsearch would return all the waypoints or coordinates for the road and the end application would then highlight those coordinates to identify the road on the map and display the metadata alongside.
The API library was developed using node.js and the user interface was built using AngularJS and Layers 3.0. [for more details refer : ‘https://openlayers.org/’ ] .
Below is a step-by-step explanation of our approach:
To understand what we have done to develop our application and which challenges we faced during development, an end user should know the basic structure of Elasticsearch and Node.js.
Step 1: Implementation of Elasticsearch and integrate it with Node.js
The above image shows the comparison between Elasticsearch and a relational database (RDBMS). Elasticsearch has a simpler structure. The structure follows the steps like:
Step 1: Create Index (my_index in image),
Step-2: Create Type (my_type in image),
Step-3: Upload JSON data to the created type as documents with separate index for each parametric data point((A,B,..X,Y in image)).
Elasticsearch supports a tree structure: Index>Type>Document>fields . In Elasticsearch multiple Indexes can be created. In each index, multiple types can be created and in each type, multiple documents can be uploaded. And also each document consists of multiple fields.
After installation of Elasticsearch (from https://www.elastic.co), it can be run on the default port 9200. We used a Node.js application for Index creation, type creation, for uploading JSON data in bulk to Elasticsearch and for the development of API that would provide all the coordinates of a road using search query of Elasticsearch(in our case). To connect Node.js with Elasticsearch, we need to install ‘Elasticsearch.js’ in our Node.js application. Elasticsearch.js is the official Elasticsearch client for Node.js by which we can create indexes, types and upload bulk JSON data to Elasticsearch.
To install Elasticsearch.js in to Node.js, refer ‘https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/about.html’
Step 2: Establish Connection between Elasticsearch and Node.js Application
As Elasticsearch.js is the Elasticsearch client for Node.js, we have to instantiate elasticsearch.Client class in the Node.js application to connect with Elasticsearch. We established the Elasticsearch connection by through a javascript file, and called it Connection.js. Following is how Elasticsearch is called in Connection.js
Using the command ‘module.exports = client’, Connection.js can be exported to anywhere in Node.js application to keep connection of Elasticsearch alive via elasticsearch.Client.
Step 3: Create Index
Indexing in Elasticsearch is not quite like indexing in other databases. In Elasticsearch, an index is a place to store related documents. To create an index in Elasticsearch with Node.js application, we require a Connection.js to be imported and then we can create an index. The following illustrates the importing of elasticsearch connection and creation of index.
The javascript file named ‘CreateIndex.js’ indicates that we are trying to create an index ‘bulkimport’ to Elasticsearch. The variable ‘client’ is used to import ‘Connection.js’. To run this script on Node.js, open command line and navigate to the location where ‘CreateIndex.js’ is located, we ran the command ‘node CreateIndex.js’. The response (shown below) indicates that index named ‘bulkimport’ has been created successfully in Elasticsearch.
To run any javascript file on Node.js, ‘node’ should be the prefix to the javascript file in command line.
Step 4: Bulk import nested JSON structures
The above sections give a basic understanding of how Elasticsearch and Node.js application can work together. Now it will be easy to understand and overcome the challenges that is typically faced around bulk imports. While there are multiple mechanisms/tools by which data can be uploaded into elasticsearch, such as using Kibana or logstash, our objective was to use a custom built node.js application to upload the data in bulk.
Objective:
Our objectives was to upload large amount of geojson data to Elasticsearch.The geoJSON format is a derivative of the JSON format file with a complex JSON tree and nested structure as shown below:
Challenges:
- How do we upload a nested JSON file structure into elasticsearch using Node.js? Does elasticsearch support a nested JSON format?
- How do we create a separate index for each data point while avoiding the risk of indexing large quantities of data under a single index, which typically is expected from a bulk upload?
While it is easy to load large amount of JSON data into Elasticsearch where every datarow is indexed separately. Our objective, however, was to upload geoJSON file (which is a custom JSON data file) to Elasticsearch using a Node.js application. We had to develop a code in Node.js that can easily upload the nested JSON data in bulk to Elasticsearch.
As discussed earlier, we already created an index named ‘bulkimport’. It was then needed to create a new ‘document type’ in the index ‘bulkimport’, so that we could upload and index the nested JSON data against the created document type. In order to automate the creation of the ‘document type’ and to upload nested JSON data with separate index to Elasticsearch, a javascript file called ‘bulkupload.js’ was used. By using this script we created document type ‘bulkdoc’ in our index ‘bulkimport’, and tried to upload geoJSON data
You can see the code mentioned in the following image:
We were getting an error while parsing our geojson file using the command result=JSON.parse(‘./filename.geojson’); as shown in the above image. The error is shown below:
Our Approach:
We had to modify our javascript code to overcome this issue. One thing was clear that to upload geojson (nested JSON) data to Elasticsearch with Node.js application, the geojson data must be parsed in an appropriate manner. After tremendous amount of search efforts we found ‘jsonfile’ package that helped us to parse the geojson data.
‘jsonfile’ is a Node.js packages that has to be installed to the Node.js application (for installation and details refer : ‘https://www.npmjs.com/package/jsonfile’). ‘Jsonfile’ can do multiple task with any type of JSON data, whether it is flat JSON or complex JSON. It would stringify the JSON data then parse the stringified JSON and then read the parsed JSON, which can be sent for upload in bulk to Elasticsearch.
We installed the jsonfile packages in our Node.js application. We then imported ‘jsonfile’ and modified the code of bulkupload.js. The updated code of bulkupload.js is given below:
By running this script on Node.JS, we uploaded large amount of geoJSON data into Elasticsearch in bulk with separate index of each records. We can run the following URL in browser to cross check whether each records of geoJSON data:
‘localhost:9200/bulkimport/bulkdoc/_search’.
- In URL localhost:9200 indicates that Elasticsearch is running on your local IP address with 9200 port.
- bulkimport indicates the index you have created in Elasticsearch and bulkdoc indicates the type you created in the index.
- _search is an REST API of Elasticsearch. We can also run curl command with _search API to cross check the same on command line or ‘Kibana’.
After importing the geojson file to Elasticsearch, we developed an API with HTTP protocol by calling the search query of Elasticsearch. This was achieved by Node.js and Express.js.
Express.js and Node.js provide a back-end functionality allowing developers to build software with JavaScript on the server side.Together, they make it possible to build an entire site with JavaScript. We integrated URL of API to our frontend UI that was developed using Angular 5 to retrieve the Geojson data by particular search text from Elasticsearch.
The image given below is the UI of our Application. The end users would be able to search particular area of the road or the entire road with it’s name or road reference number. As per the road reference number mentioned in the search engine of the image, the highlighted part of the image is the end result for the required location.
— Ami Bajwala (Software Engineer)
