A Complete Guide for Lighting Fast Autocomplete Search Suggestion

Piyush Arya
9 min readSep 23, 2018

--

So after binge watching the first Season of The Marvelous Mrs. Maisel, which was due for a long time, I wondered what else ?

-Binge Watcher

Having read the completely unrelated opening lines of a tech blog, the content that follows can’t possibly disappoint you, so go on.

The reason I chose something as simple as autocomplete search suggestion over deep learning, AI and big data is because
1. The search box at the top of the website/app is the most fundamental feature of any product.
2. Everyone uses it. It is a gateway to your product.
3. Lack of documentation for implementation and optimization.
4. My knowledge of ML is not worth writing the blog yet.

Just so that we are on the same page, the suggested searches are what we are trying to implement. On a lighter note, I still don’t understand why Google still have I’m Feeling Lucky button.

I have divided this guide into multiple sections to make it useful for multiple use cases from a quick and dirty solution in 15 mins to an advanced lightning fast microservice. I will start with a vanilla implementation and improve upon usability, speed and scale.

To begin with, an autocomplete service has two challenges
1. Speed — Very low response time ~100ms.
2. Ranking — Due to the limited real estate, only 6–10 suggestion can be shown. The optimal ranking is a challenge.

A peek into implementation

There are two parts to the service

  1. Client-side — Responsible for sending keystrokes, cache at client side and rendering of results. All of this is bundled nicely into JQuery UI Autocomplete package. The fully working snippet can be found at the given link.
  2. Server-side — Responsible for returning ranked results against a query. If your data size exceeds a couple of MBs, this is where the action will happen. I Will dig deep into this API implementation for rest of the article. Let's say this API is

GET /service/suggestions?q={term}

Implementation for API

Let's start with the quickest solution to get up and running in a couple of minutes. Use any SQL database to store data. A simple LIKE query is all you need.

Select name from movies where name like ‘%term%’ limit 10;

That’s it! Your implementation is as good as 90% of the websites but you should keep on reading if you are planning to write software for human beings.

The problem with the above implementation is the ranking of the results. Let's look at results from JQuery UI demo

Snapshot of JQuery UI demo. I am searching for Hawk. Results marked in red are unintuitive as “ha” happens to be in between, instead of word boundary

I am trying to search for hawk. On typing “ha” it shows everything that matches “ha” and to make it worse, anything with hawk doesn’t even show up due to the limited placeholders.
Practically speaking, if somebody starts typing in the search box, that query should match the word boundary from all possible suggestions instead of a substring. For example the query “go” searching in a set of programming languages, I am probably looking for Go or Golo or Pro Go and not for Logo and Hugo.

Which brings to our next improvement, reinforcing 80–20 rule. Just by modifying the first SQL query slightly, the suggestions will improve drastically by matching the word boundary instead of a substring.

Select name from movies where name like ‘term%’ or name like ‘% term%’ limit 10; — — mind the space in % term%

Congratulate yourself!! just by adding or name like ‘% term%’ you have made your search suggestion better than most of the products on the internet. This solution will work in a reasonable time if your dataset is of the order 100k.

Implementation for API — Lightning Fast Beast Mode

Depending on your business needs the solution above may suffice but the products where speed and user experience is a core feature, there is a lot of scope for optimization. Before diving deep, let me add some context as there are multiple ways to optimize speed and ranking further. It will justify our decisions(smartly slipping in the promo). MyMovieRack is primarily a movie & TV show platform with the core value of personalizing the entertainment. So optimizing the search experience serves as a heartbeat for the product. A caching layer over the SQL query was working fine for us until a couple of months back. What forced us to move to even a faster implementation (~50ms) was

  1. Growth in the number of movies & TV shows & our recent hit — web series.
  2. More loyal users visiting the site regularly. There is a drastic behavior change when direct users are compared with SERP traffic. Loyal users tend to use the search box for rating, reviewing etc more frequently while the SERP traffic lands to the relevant page from Google.
  3. Last but not the least — and the deal breaker — the feature to create the list of movies & shows. To create a list, one has to add movies by searching and selecting from suggestions. For a list of 100 films, the user has to do this 100 times. For a customer eccentric product, delay of ~1.2 sec (benchmarked from multiple clients in India) for each suggestion call was not acceptable. Hence the need for faster API.

We started with two major approaches to solve the problem. The basic idea in both approaches is to create all possible prefix of the term and do a fast lookup in constant time after each keystroke of the user. Each of these prefixes will point to actual name -> The Matrix. For example, all prefixes for The Matrix are listed below.

T
Th
The
The
The M
The Ma
The Mat
The Matr
The Matri
The Matrix
M
Ma
Mat
Matr
Matri
Matrix

This idea can be implemented using the following approach:

  1. Using Trie: Insert all prefix in a trie and link the leaf node to object required by autocomplete widget, something like {“name”:”The Matrix”,”img”:”img url”}. Implementation wise the leaf node will link to a priority queue of ids sorted by relevance score(remember optimal ranking). These ids will point to the actual data.
    We choose not to use trie because our backend is in PHP (stop judging). So every new request has to initialize its own trie which doesn’t make sense. Too lazy to setup JAVA (or any other) server for same. Too lazy to write other reasons when there much easier to way to implement using Redis.
  2. Using In-Memory Cache: I am a fan of Redis. Big fan. From this point onwards I will swap the word Redis with in-memory cache. In addition to being an in-memory key value store, Redis supports a few basic data structures which make it the best tool for the problem.
    Implementation basically involves having sorted sets for each prefix of all movie names and movie ids as members for the sets. Initially, it feels unintuitive/keyspace-explosion/redundant. We cross-checked the internet if this approach is nuts. This blog by Redis creator Antirez talks about the similar service. Similarly, this blog covers two different approaches. A part of our solution overlaps with these approaches. I encourage you to go through these blogs for detailed implementation.
    Here is our rough implementation — Create sorted sets corresponding to all prefix for all names and add ids in these sets. Store the data corresponding to these ids in a hash for final lookup. Pseudo code will look like this

foreach (moviename as m){
foreach (m.getPrefix() as p){
zset autocomplete::p score m.id
}
set data::m.id {“n”:”m.name”,”i”:”m.img”}
}

We spawned the cheapest Redis instance (Elastic Cache) on EC2 which offers 500 MB of storage. Since MyMovieRack is free to all users, frugality is something that is practiced religiously for saving server costs. The naive implementation on this box took more than 400 MB of space. Below is the series of optimizations done to reduce the memory space by 75%.

The quest to reduce space by 75%

We ran the test on a limited set of ~90,000 movie names. After the naive implementation of the above-mentioned pseudocode:

Total Prefix Keys: 2,312,783
Total Space Used: 400MB

Now, this was a setback as total films were much more than that. Not to mention the extendability plans. What if we plan to support celebrity name suggestions in the same search taking a holistic approach to discovery (By the way, we now support celebrity search:) ).

Time to put on optimization hat and dig into data.

Constraints of a problem are actually the opportunity to optimize

Optimisation 1 — Don’t store prefix of length 1

So suggestions based on one input character doesn’t make much sense. Even if you want to support it, you can hardcode top suggestions corresponding to these 26 buckets instead of storing all the names, as every possible name will fall into one of the buckets wasting a lot of space. Results after this optimization:

Total Prefix: 2,225,827 (3% less)
Total Space: 350MB (12% less)

Optimisation 2— Don’t store prefix starting with stop words (if that stopword is not the first word of the movie name)

This observation was based on the historical data of the searches we had. People generally don’t start their search with a stop word like (a, an, the, of etc) unless it is the first word of the name. For example, the possible search attempts for The Pursuit of Happyness could be “the pursuit”, “pursuit of ”, “happyness” but not “of happyn”

Similarly, for The Lord of the Rings, possible attempts could be “the lord”, ”lord of”, ”rings” but nobody starts their search with “of the rings”. The list below will give an idea about the redundant prefixes being generated.

All prefix for "The Lord of the Rings" (64)
T
Th
The
The
The L
The Lo
The Lor
The Lord
The Lord
The Lord o
The Lord of
The Lord of
The Lord of t
The Lord of th
The Lord of the
The Lord of the
The Lord of the R
The Lord of the Ri
The Lord of the Rin
The Lord of the Ring
The Lord of the Rings
L
Lo
Lor
Lord
Lord
Lord o
Lord of
Lord of
Lord of t
Lord of th
Lord of the
Lord of the
Lord of the R
Lord of the Ri
Lord of the Rin
Lord of the Ring
Lord of the Rings
o
of
of
of t
of th
of the
of the
of the R
of the Ri
of the Rin
of the Ring
of the Rings
t
th
the
the
the R
the Ri
the Rin
the Ring
the Rings
R
Ri
Rin
Ring
Rings

The ones marked in the bold will be removed (19 out of 63) due to this optimization. Single letter prefixes will be removed because of the first optimization.

The best part of this optimization — it makes the results more robust because it ignores the stop words. Results after this optimization:

Total Prefix: 1,831,758 (17% less than last optimization)
Total Space: 280MB (20% less than the last optimization)

Optimisation 3— An offer you can’t refuse — Limiting prefix max length to 8 char (or n character in general based on business logic)

We hit the jackpot with this one. On investigating the distribution of prefixes(sorted sets), it turns out cardinality of ~ 95% of the prefix is 1. It means a name can be uniquely identified after first n characters and storing the rest of the characters doesn’t improve the solution. Also, by limiting the max length of a prefix, the average cardinality can be increased.

Instead of having 100 prefixes (sorted sets ) with one element each, it is optimal to have 10 prefixes with 10 elements each.

For example, after restricting the maximum length of prefix to 8 for “The Lord of the Rings”, the keyspace is drastically reduced and the average cardinality will be more than 1 as same prefix will be shared by many movie names. The final set of prefixes after all optimizations:

All prefix for "The Lord of the Rings" after all optimisations (18)
Th
The
The
The L
The Lo
The Lor
The Lord
Lo
Lor
Lord
Lord
Lord o
Lord of
Lord of
Ri
Rin
Ring
Rings

Total Prefix: 357,465 (80% less than previous)
Total Space: 102MB (63% less than previous)

Hell Yeah — A total of 75% optimization in space.

Moving further

This is just an overview of what we have implemented. Scoring of results can be implemented in sorted sets easily based on business logic. The blog mentioned earlier uses an alternative approach which limits the prefix to a single word and relies on Redis set union for getting the final results. Feel free to explore that approach and post your findings. Also, this is my first blog and it took a lot of time to finish. I might have skipped some implementation details or explanations. If something is not clear, don’t hesitate to post in the comments.

You can see the autocomplete in action at movie discovery page.

This is Piyush Arya. Founder of MyMovieRack. Working as a Technical Team Lead with super smart people of Directi(Media.net). Always up for discussing crazy ideas. If you are a movie buff or a tech enthusiast, feel free to connect with me and join us at mission of making entertainment personalized at MyMovieRack.

--

--