Using Elasticsearch with Kotlin
Last week, I had to add search functionality to my website. There already was some kind of search mechanism, but it could only list files based on their title and not their content. Since the content is the most important part of these files, I needed to find ways to dynamically search the content of files without having to deal with performance issues. That was exactly where I stumbled upon Elasticsearch.
What is Elasticsearch?
According to Elastics own website…
Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic)
This means that you can search pretty much any kind of data stored as Chars, as long as you map it into the right base-format.
How Elasticsearch works
Elasticsearch takes in data from different locations, stores and indexes it according to user-specified mapping (which can also be derived automatically from data), and makes it searchable. Its distributed architecture makes it possible to search and analyze huge volumes of data in near real time.
The main benefit of Elasticsearch is its speed. Normally, when you would filter your content manually, it would take ages to find the according document. With Elasticsearch, the search takes dramatically less time, because of its clever storage and indexing in JSON format.
Elasticsearch uses Shards, to distribute the data in the cluster. A shard is the unit in which the data of the cluster gets organized. Each index can constist of at least one to multiple shards. When a disk suddenly stops working, there are replicas on the other nodes of the cluster to restore the data.
How to use Elasticsearch
Elasticsearch provides many possibilities to interact with its API. On the one hand, there is the standard REST API, on the other hand, there are many different API clients available for every major programming language. Obviously there is one for java, which we could also use in Kotlin, but this client needs a lot of third party dependencies to work. Most of these dependencies are not common in a standard Kotlin project and are normally replaced by their Kotlin alternatives. You could change the used dependencies, but this would make the project unnecessarily complex.
The good thing is, that there is a great community-supported Kotlin-client available on Github. This client relies on standard Kotlin libraries that don’t interfere with the rest of the project.
Creating the client
First of all, we need to declare the dependency in our build.gradle.kts file.
After that, we can create our client like this:
The next thing to do is to create a data class which will be used to upload data as JSON to the index of Elasticsearch.
But first, we need to know what an index is. According to the Elastic website:
An index is like a ‘database’ in a relational database. It has a mapping which defines multiple types.
An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.
Ok. So there are two concepts in that definition. First, an index is some type of data organization mechanism, allowing the user to partition data a certain way. The second concept relates to replicas and shards, the mechanism Elasticsearch uses to distribute data around the cluster.
As you can see, the indexing function takes a list of ElasticDocuments(which could be any data class) as an argument and loops through them. I have chunked this list into smaller pieces, because there is a limit on how many documents you can index in one bulk request. The exact value of “ES_BULK_CHUNK_SIZE” can only defined by trying out and keeping an eye on the performance.
After that, I call the
bulk method of the client and index my ElasticDocument. Here it is important to know that you can use the DEFAULT_JSON provided by kt-search and don’t need to create your own JSON mapper. Each Document needs an ID. Thats why we set the esDocument-id as the id of the resource. Another thing to note, is, that you always need to define the name of the index in which the document should be saved at.
Searching in the index
The most important part of Elasticsearch is the search itself. Elasticsearch provides many possible combinations of search filters, but I will only look at the one that I used in my project.
This function returns us only the documents that contain parts of the title and the version that we are providing. In this case “Some Title” and “v1.2.3”.
You can configure the fuzzyness of this search query by providing the fuzziness attribute and configuring a different value.
If you want to exactly match values with a query, you can use the
term matcher. This matcher straight up compares the given value with the index and returns its results. It doesn’t use any kind of fuzzy matching and is therefore ideal for deleting documents based on their identifying attributes.
You can add more complexity by adding other keywords like
should or by adding more parameters. Here is a list of what is possible with ElasticSearch query-DSL.
If you want to seee some more great examples, take a look at this blogpost.
.ids function at the end of the call, returns only the Ids of the found search results. If you need the whole object, you need to call
.hit.hit.source.toString , which will return the object as a JSON String.
What went good
I think the actual implementation of the search, after I figured out which plugin to use, worked pretty good and was self explanatory.
What needs improvement
I wasted way too much time on trying to use the Java client instead of looking for more suitable solutions for my problem. The next problem was how I could test it. First, I was only able to test it manually by running a local Docker container, which is not optimal by any means. After I asked my team for any better solutions, I started using Testcontainers for my Integration tests. With these tests, my code became much more reliable and I was finally able to write some advanced logic for my project.