Making your search not suck with Elasticsearch — Part 7: Machines that learn
This is part 7 of a multi-part series. In this series I will be explaining important concepts in Elasticsearch and using a demo app I built to demonstrate these concepts. You can follow me for updates as they come out over the next few weeks. If you’d like to start at the beginning click here. You can also find a full index of the series at the end of this post.
Last month I gave a talk at Orlando Devs that was basically this blog series in talk form. It’s really come full circle considering I started the blog series after giving a talk at Orlando Code Camp back in April. Writing the blog series has definitely helped crystallize my thoughts and the feedback I’ve received has helped me communicate it better. I’m very grateful to those of you who’ve followed along and provided that feedback.
Giving the talk forced me to think about how to end the series. When I started this series I originally intended to end it here. Along the way I’ve realized there’s a lot more I’d like to cover about Elasticsearch. That said there’s a lot of other topics I’d like to talk about as well. So I’ve decided to make this the last post of the series and leave the remaining topics for a future series.
I think it’s an appropriate place to wrap things up. The topics covered in the previous 6 parts are a very good starting point that will get you 90% of the way there in most use-cases. Today’s topic is a nice cherry on top.
In my last post we learned about how Elasticsearch calculates relevance and we found that unfortunately it doesn’t know anything about cultural relevance. Today we’re gonna learn about how we can teach Elasticsearch about cultural relevance. More generally we’re going to learn how we can boost our search results by popularity to improve our search experience.
It’s actually a pretty simple idea. Imagine if every time a user searches for something in your application and selects a result you recorded which result was selected so that your index had a little more data that looks like this:
"name": "Star Wars Episode IV: A New Hope",
"name": "Saving Star Wars",
Over time what you’re doing is adding extra data to your index that represents what your users consider relevant. The cool thing is you can pretty much apply this to any application and any data set. In the case of movies we might have seeded the index with something like box office numbers but it’s the same basic idea.
Using the data
We can then use this data with another feature of Elasticsearch called function scoring. Function scoring allows you to apply a mathematical operation to your relevance scores. For example, you could multiply all of your relevance scores by two. That wouldn’t be very useful but combining function scoring with a feature called the field value factor function is very useful. The field value factor function allows you to use values stored in your index to modify the relevance scores of your search results. Specifically it allows you to multiply the relevance score of each of your search results by their popularity so you can can effectively boost the more popular results over the less popular ones. Mathematically that looks something like:
new_score = old_score * log(1 + popularity)
Pretty simple. You might notice we’re logarithmically scaling the popularity boost. That’s pretty important because without it your popular content could be easily over-boosted and returned as the first result in unrelated searches. Applying the log function to the popularity helps prevent that from happening.
Trying it out
Remember that back in part 5 we started using phrase matching with the standard analyzer and the snowball analyzer but the results were still not quite what we expected when we searched for “star wars”:
When I search for “star wars” I expect to see the movies right at the top. Let’s see if we can use popularity boosting to make that happen. What I’ve done is faked the data to artificially inflate the popularity field for the Star Wars movies in the index. Using popularity boosting in my demo app we get:
In the words of young Anakin Skywalker: yipeeeee! That’s the kind of results I expect when I search for “star wars”.
What we’ve done is a very simple and crude way of training your data set and there are certainly more advanced things you can do with it. The Elasticsearch team itself has been hard at work trying to make Elasticsearch a platform for machine learning. But I would say once again that this approach will get you pretty far in most use-cases.
It’s been a long but fun road getting to this point. I think it’s worth doing a recap of the whole series. My original motivation for doing this series was to communicate some of the key insights that helped me understand how to effectively work with Elasticsearch and help others avoid some of the pitfalls that I fell into.
So here’s what we covered in a nutshell:
- An index is basically a data structure that “inverts” the way content is stored so that you can quickly find which documents contain a given term
- The snowball analyzer can be used to help find non-exact matches but…
- It should be used in conjuction with the standard analyzer so that exact matches will be ranked over non-exact ones
- Use phrase matching so Elasticsearch will take into account the phrasing of your query in relation to the documents
- Relevance is calculated based on three factors: how many term matches there are, how large the document is, and how rare the matched query term is.
- Use popularity boosting as the cherry on top if your search results aren’t quite good enough
And that’s where I’ll leave it for now. As I said there are a lot of other things I could say about Elasticsearch. Some things I know for sure I’d like to cover eventually include edge ngrams, highlights, and suggestions. It’s my hope that the series has been a good primer for someone just getting started with Elasticsearch. If you liked it I hope you’ll follow along for future updates about Elasticsearch and more.
Once again, thanks for reading and happy searching!
The “Making your search not suck with Elasticsearch” series:
Part 1: What is an index?
Part 2: Elasticsearch is not magic
Part 3: Analysis Paralysis
Part 4: Overanalyzing it
Part 5: Are we still doing phrasing?
Part 6: Totally irrelevant
Part 7: Machines that learn