Making your search not suck with Elasticsearch — Part 2: Elasticsearch is not magic

Alex Denton
4 min readApr 18, 2017

--

This is part 2 of a multi-part series. In this series I will be explaining important concepts in Elasticsearch and using a demo app I built to demonstrate these concepts. You can follow me for updates as they come out over the next few weeks. If you’d like to start at the beginning click here. You can also find a full index of the series at the end of this post.

In my last post I explained what an index is and why it’s necessary. Today we’re going to start actually using Elasticsearch to search content using the default settings and then we’ll see why unfortunately Elasticsearch is not magic.

Since I’m going to start using my demo app today I want to explain my setup and how it all works together. The app is composed of three main parts: an Elasticsearch index, a small REST API that communicates with the Elasticsearch index, and a small react app that communicates with the API. Once again you can find all of this on my github here (note: none of this was made to be beautiful, production-ready code so please don’t judge me too harshly for some of the rough edges).

The index itself is from an online data source I found of approximately 80,000 films (which is probably somewhere between a third and a sixth of all movies ever made). It has three fields: an ID, the name of the film, and a plot summary.

The react app is pretty straightforward:

You search on the left and results appear on the right. There are settings below the search box but we’re not going to worry about those today.

So that’s my setup. To start things off let’s take the example above of searching for “star wars”. What we’ve done here is basically search using all of the default settings of Elasticsearch and you can see we get results that look somewhat relevant but they’re definitely not the search results that you and I expect when we search for “star wars”.

Unfortunately, Elasticsearch is not magic. You can’t just give it your data and expect it to know the best way to index and search it. Over the course of this blog series I’m going to be making iterative tweaks to the search settings to progressively improve the search results until eventually they match the results we expect.

The default settings are pretty broken for actually searching human-readable content. In fact, they’re much more broken than they appear to be in the example above. To illustrate what I mean I’ll search for “star war” instead off “star wars”:

And you can see the results get much worse. There’s nothing in the top results that even references the Star Wars films. Why did that happen?

In order to understand why this happened we need to understand exactly how Elasticsearch actually searches an index. Let’s assume we add “Star Wars: The Empire Strikes Back” to our example index so that now our index looks like this:

What we’re doing when we ask Elasticsearch to search for “star wars” is saying: go to the row in the index for “star” and return all the documents at that row, and then go to the row for “wars” and return all the documents at that row. If a document is in both of these result sets it should be weighted higher and returned at the top of the search results.

Specifically, when searching for “star wars” in this index we go to the row for “star” and we find that document 4 is there, and then we go to the row for “wars” and we find that document 4 is also there. Therefore document 4 will be returned at the top of the search results.

If, however, we search for “star war” then we go to the row for “star” and we find that document 4 is there and then we go to the row for “war” and… it doesn’t exist. We don’t have any documents in our index that contain the word “war” we only have a row for “wars”. Therefore searching “star war” only matches on “star” and not on “war”. This is why we didn’t get the results we expect.

It might seem like Elasticsearch is being obtuse but remember the whole point of building an index is to do really quick lookups. If suddenly Elasticsearch had to take into account all of the permutations of “war” when searching for “wars” it would introduce complexity and ultimately slowness into our search algorithm.

Luckily, there is a pretty straightforward solution to this problem but to explain it properly I’ll need to introduce the concept of analysis as it relates to search engines.

That’s probably a good stopping point for now. In my next post I’ll explain the concept of analysis and how you can use different analyzers to solve the problem I illustrated above.

Look out for the next part in the series Making your search not suck with Elasticsearch — Part 3: Analysis Paralysis.

Until then, happy searching!

The “Making your search not suck with Elasticsearch” series:
Part 1: What is an index?
Part 2: Elasticsearch is not magic
Part 3: Analysis Paralysis
Part 4: Overanalyzing it
Part 5: Are we still doing phrasing?
Part 6: Totally irrelevant
Part 7: Machines that learn

--

--