Making your search not suck with Elasticsearch — Part 1: What is an index?

Alex Denton
4 min readApr 17, 2017

--

This is part 1 of a multi-part series. In this series I will be explaining important concepts in Elasticsearch and using a demo app I built to demonstrate these concepts. You can follow me for updates as they come out over the next few weeks. If you’d like to start at the beginning click here. You can also find a full index of the series at the end of this post.

This past weekend I spoke at Orlando Code Camp about my experience with using Elasticsearch. I’ve been meaning to blogify this topic for a while and since I’ve been thinking about it a lot the last couple weeks now seems like the right time.

I should say something upfront. I am by no means an expert on search engines or Elasticsearch. I had the luxury of a pretty intense 6-month deep-dive so my knowledge is definitely beyond superficial but there are definitely people out there who know way more than me. I’m definitely not going to impress those people with my knowledge. But for those of you who know absolutely nothing or are just getting started I think I can offer some insight that will help you see the bigger picture.

One of the great things about using Elasticsearch is it does have good documentation and a strong community around it. With that said, it’s still a big, complicated beast and there’s a lot to learn. While there is a lot of good documentation about how to use Elasticsearch and what you can do with it there’s not a lot of information out there about how to put it all together and make a search experience that doesn’t suck. And that’s what I hope to accomplish with this blog series.

One more piece of housekeeping before I dive in. I created a demo app as a companion to my talk to help illustrate my points. It’s on my github here and I recommend you check it out if you want to see more of the technical details or you just want to play around with a working search app.

With that said, let’s get started.

In this series I will try not to assume prior knowledge and so that’s why we’re going to start with the basics: what is an index?

In order to answer that question I want to think about the simplest way you could possibly search for something. Let’s say we have some text like:

The quick brown fox jumped over the lazy dog

And we want to search for the word “fox”. The simplest way we could do that would be something like:

foreach (var word in words)
{
if (word == query)
{
return true;
}
}

Where we iterate over each of the words in the text and if the word matches our query we return that document. There’s a lot of things wrong with this but the biggest problem is this is just too slow. If we had a million documents that had a hundred lines each it would just take too long to go document-by-document, line-by-line, and word-by-word. So what do we do about it? How do we make it fast?

Well, we do what we always do as smart programmers: we change the way the data is stored. For example, let’s say we have documents that look like this:

1. The quick brown fox jumped over the lazy dog2. The Cat in the Hat3. Fox in Socks

If we had to search for all the documents that contain the word “fox” in this format it would be slow and cumbersome. But what if we “inverted” the way we store this data? What if we created a new data structure by iterating over each word in each document and if we’ve never seen the word before we add a new row with the ID of the document and if we have seen the word before we just add the document ID to the existing row. If we did that for the documents above we’d end up with something that looks like this:

Now if we want to find all the documents that contain the word “fox” we just go to the row for “fox” and we have an already compiled list of all the documents that contain the word “fox”. It’s now a nearly instantaneous operation.

This data structure is called an “index”. Technically it’s an inverted index but colloquially it’s just called an index. Chances are if you know anything about search engines you’ve at least heard about an index. Obviously a real index is a lot more complicated than this but this is the key intuition behind what an index is and why it’s needed.

The index is really the heart of how a search engine works. I wanted to start with this because understanding indexes is pretty much required knowledge for making a search experience that doesn’t suck.

I want to stop here for now. In my next post we’ll start actually searching with Elasticsearch and we’ll see that unfortunately Elasticsearch is not magic.

Look out for the next part in the series Making your search not suck with Elasticsearch — Part 2: Elasticsearch is not magic.

Until then, happy searching!

The “Making your search not suck with Elasticsearch” series:
Part 1: What is an index?
Part 2: Elasticsearch is not magic
Part 3: Analysis Paralysis
Part 4: Overanalyzing it
Part 5: Are we still doing phrasing?
Part 6: Totally irrelevant
Part 7: Machines that learn

--

--