Fun with Search Algorithms

Roberta Doyle
Unsplash Blog
Published in
3 min readAug 13, 2018
Photo by Annie Theby on Unsplash

At Unsplash we’re constantly working on improving our search. This is obviously not a surprise as search lies at the heart of our platform. As a team, we work on 6-week roadmaps or sprints.

Overall we’ve been taking a hard look at our search algorithm and evaluating how to scale more efficiently. During our most recent roadmap, we dedicated some time to test different approaches to a challenge that we have in search: incorporating confidence in photo tags.

I ended up spending a few days working on a prototype using FunctionScore in Elasticsearch. It’s been more than a year now that I’ve been actively working with Elasticsearch, but my co-workers and I often joke at the wide range of emotions we feel using ES — from feeling like a rockstar one second to a total neophyte that still hasn’t learned a fraction of the potential that lies within ES.

As I was working on this prototype, I slowly learned a few more things about Elasticsearch and I thought it would be useful to share with the world.

Index with Nested Datatype

I’ve sent PUT requests to the mapping API to add fields to an existing index. However, I hadn’t done that with the nested datatype specifically. I wanted to store in the index a tag with its confidence, so in order to maintain that relationship I was looking into nested datatypes, which is an array of JSON objects. I came to realize that if I wanted a nested datatype I needed to delete and recreate the index defining the nested datatype in the initial mapping.

Highlighting

Over the past year I’ve become familiar with the explain API and I’ve used it several times. It’s a great resource to debug why a photo is scoring high or low in Elasticsearch for a specific query. I’ve also used when we get an unexpected photo showing up on results as it shows an explanation of the scoring calculation.

What I liked about highlighting is that it allows you to understand how a query matches to a certain field. In this case, I wanted to look at the confidence of the photos that were being returned in the search results.

When requesting highlights in the search query, for each document returned you get an additional object in the response — inner_hits.

Profiling

Each person in the backend team was working on a different prototype or idea. We looked at the search results later to compare quality and performance.

Adding profile: true to the search query allows you to evaluate how long it takes for your query to be executed. You can also see how the query is translated to Lucene and the breakdown by shard.

Next on the agenda is to explore the impact of a FunctionScore in a complex ES query and decide what can be done at index-time that could give us some performance gains at query time.

We know how important search is and we’re working hard to make constant improvements. I want to thank our incredible community for reporting the issues they find and for their patience while we fix some issues and work to get our search to the next level. 💥

--

--