ElasticSearch: the COVID-19 problem

D F Catita
UpHill Health | Engineering & Design
5 min readApr 8, 2021

How UpHill is using ElasticSearch for complex search functionality?

If you visit www.uphillhealth.com you will notice that one of its main features is a search bar that allows searching for medical content: algorithms, events, clinical cases, articles, resources, collections, specialties, or institutions.

UphillHealth’s front page.

Recently, with the ongoing COVID-19 pandemic (I’m sure you heard of) our team increased the number of content regarding this new disease and noticed that some things were not quite right with the way our search results were presented:

  1. Searching for COVID, COVID-19 or COVID19 would present different results;
  2. Algorithms, the content we wanted to see prioritized, would not show up before other types of content;
  3. Searching for the title of something would not guarantee it showing up in the first result.

Powering up our search results is ElasticSearch — but what exactly is that?

ElasticSearch is an open source analytics and full-text search engine that allows for complex search functionality, including auto-completion, highlighting matches, handling synonyms, boosting relevance, etc. It is based on the Lucene library and provides an HTTP web interface and schema-free JSON documents (more: https://www.elastic.co).

At this point, I had heard of ElasticSearch and its power, but never actually implemented it and plunged headfirst into UpHill’s current implementation of ElasticSearch and tons of documentation.

I started by learning that our content was stored in an index. The elastic website explains that an index would be something similar to a MySQL database — a concept easier to grasp — where documents are stored.

After figuring out what indexes we had, I searched for elastic’s HTTP web interface endpoints to help me figure out its settings. Luckily, that was an easy one: GET /{index_name}/_settings

Former index settings results.

As expected, these settings were kind of lacking and did not include the analyzers I had seen in other people’s examples all over the internet. The solution to some of my problems seemed to be close.

Analyzers determine how a string in a document is transformed into terms in the index. ElasticSearch comes with a bunch of analyzers out-of-the-box, but if no analyzers are suitable for your needs, custom ones can be created. For example, a whitespace analyzer divides the text into terms separated by whitespaces.

With the GET _analyze endpoint you can try out analyzers and see what’s the outcome of the said analyzer (read more here). This is helpful to try real case scenarios and see exactly what analyzers you need to include.

With no analyzers in place, for ‘covid-19: diagnosis and treatment’ this would be the outcome:

I knew I had to put some analyzers in:

We needed the hyphen to be discarded, separated by whitespaces and accented characters (diagnosis in Portuguese is spelled diagnóstico) to be converted into their non-accented counterparts, and both versions used as searchable tokens.

Now, because our elastic cluster was hosted by AWS, we couldn’t just update our current index with the new analyzers as that is not supported, which meant I had to create a new index all together.

PUT {new_index_name}

"settings": {
"analysis": {
"analyzer": {
...

Our analysis would end up looking like this:

Current analysis settings.

So now, testing the analyzer we can get the proper results:

After creating this new index, we needed to clone the existing index’s contents to the new one. ElasticSearch also helps with that:

POST /_reindex

{
"source": {
"index": <old_index_name>
},
"dest": {
"index": <new_index_name>
}
}

With this request, all the content is cloned and the new index is ready to be used.

After all, I had solved problem number 1! The other problems would be fixed with a change to how the query to elastic search was being made.

Elastic provides its DSL — Domain Specific Language — for queries, based on a JSON that itself contains all the needed query expressions, sometimes wrapped in each other (see here).

The queries in elastic contain a bunch of expressions with Boolean operators (AND, NOT, and OR) and support fields, ranges, wildcards, regex, fuzzy, and terms (exact matches).

To meet the requirements and fix problem number 2, we would have to boost the exact matches to be the first result AND boost results that represent algorithms.

In the end, we now have a combination of a fuzzy query (multi-match, as it pertains to multiple fields to query from) and several wildcard queries.

A fuzzy query uses similarity based on the Levenshtein edit distance, and we are allowed to set that maximum edit distance, and other settings (more details here). One of the settings we can leverage here is the “boost” so that if a match is found, its interest is multiplied by a factor. With this, I was able to boost fields unique to algorithms and make those results pop up first.

As the name implies, a wildcard query contemplates placeholders for one or more characters (see here), which allows the search term ‘a covid algorithm’ to match ‘covid algorithm’.

At UpHill, we are using java and the elastic search packages to generate the queries, and combine ‘shoulds’ and ‘musts’ that represent ‘ors’ and ‘ands’ respectively, and so our query building looks pretty much like:

MultiMatchQueryBuilder fuzzyQuery = QueryBuilders.multiMatchQuery(keyword)
.field("title", 7)
...
.fuzziness(Fuzziness.TWO).prefixLength(2).maxExpansions(10).operator(Operator.AND).analyzer("standard");
BoolQueryBuilder query = new BoolQueryBuilder()
.should(QueryBuilders.wildcardQuery("title", "*" + keyword + "*").boost(10))
...
.should(fuzzyQuery);
BoolQueryBuilder finalQueryBuilder = new BoolQueryBuilder()
.must(query);

With these small changes, we improved the way search works on our website and made it easier and faster for users to find and use the content they were looking for instead of scrolling a few pages before finding it.

Notice how the result has no hyphen, it’s not in lower case and no colon, but still shows up as the first result? Now that’s awesome!

As an extra tip, during development, I wanted to see the exact query I was building, and so I added these properties to our application.properties file (we are using Spring):

logging.level.org.springframework.data.elasticsearch.client.WIRE=trace
logging.level.org.springframework.data.elasticsearch.core=DEBUG
logging.level.org.elasticsearch.index.search.slowlog.query=INFO
spring.data.elasticsearch.properties.index.search.slowlog.threshold.query.info=1ms
logging.level.tracer=TRACE

We are then able to copy the query that it prints and use it directly on the postman. Additionally, you can add the explain functionality (check it) to understand how elastic got to the presented results.

Thanks for reading!

--

--