Image for post
Image for post

Building Search into Your Application

Bryan Hanner
Sep 21, 2018 · 10 min read

Introduction

Search is often one of the most important elements of a content-centered service because it’s the main way people find the content they are looking for precisely and quickly. Search is prominent in our knowledge base system too. When customer service agents open our application or when customers use the help center on one of Walmart’s e-Commerce sites, they’ll usually have a particular topic they need to learn more details about, and we plan for search to be their go-to approach.

Image for post
Image for post
One of the search bars I powered this summer

Terminology

Firstly, the index defines what data is stored and how for each entity or document to be searched. Each piece of data or field within a document can have certain properties, including being searchable, sort-able, and filterable. Each field’s properties are defined in the index. Each field is subsequently split or tokenized into words or terms (or tokens). Each field has one analyzer that specifies both tokenization (e.g. finding words separated by white space and transforming them into their root form in English) and filtering (e.g. stripping out HTML) on a granularity of either tokens or characters. The analyzer is also specified in the index. When a document is indexed, each filtered token for each field is stored in the index, or indexed. A popular token filter is the EdgeNGram, as Lucene and Azure Search call it. An edge n-gram is one flavor of an n-gram, which is itself “a contiguous sequence of n items from a given sample of text or speech”. In the case of edge n-grams, the items are characters, and an edge n-gram filter creates contiguous combinations of characters starting from one end of the word or the other (not the middle). An edge n-gram filter allows for matching prefixes of a word to occurrences of the full word in searchable text (e.g. a search for “ret” matches an article with “return”), which is called prefix matching. An edge n-gram starting from the beginning of each word with a min n-gram size of 1 and a max n-gram size of 20 would transform the term “return” into “r”, “re”, “ret”, “retu”, “retur”, and “return”: all of which would be indexed. The indexer keeps the data stored in the index up to date. In our system design, the indexer is its own microservice that listens to the database for changes.

Choosing your Search Technology

Understanding your Priorities

Trust and integration

Image for post
Image for post
The knowledge base design serving companies A, B, and C

Customization

Performance

Design

Below is a simplified version of my search API endpoint.

An example of what a user sends the API in the body of a POST request:

{
"search":"price match",
"fields":["content", "title"],
"where": {
"createdBy": "John.Smith"
}
}

And here is what I then send to Azure Search behind the scenes

{
"search":"price match",
"searchFields":"content, title",
"filter":"createdBy eq 'John.Smith'",
...and other params for number of docs returned, highlighting...
}

The API endpoint specifies the functionality that we provide, while the implementation details of proxying the request to Azure Search is hidden. By providing a layer of indirection between the search service and the front-end(s), I can change the underlying service to Elasticsearch or any other technology by only rewriting the translation from our API’s format to the search service’s format. The front-end need not change, and that makes sense because nothing on their side is changing. By leveraging good abstraction, I made our system more reliable and backward compatible, which in turn minimizes interface changes to maximize healthy growth of our system as we add new features.

Tune the Parameters

Where to use EdgeNGram (Y = correct behavior, N = incorrect behavior)

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

I ended up discovering that we needed to separate the analyzer into two parts: one for indexing and one for searching the content. If you choose a generic “analyzer”, the analyzer is used for both by default. When we included the n-gram on both indexing and searching, the search terms were broken into n-grams and matched against the searchable content. “Retail” was then broken into prefixes (including “ret”), which would then match “return” in an article (since “return” also has the prefix “ret”). This was the wrong behavior: a user generally expects only the entire input to be matched against the content, not its prefixes as well. To fix this, I removed the n-gram filter from the search analyzer. Then when the user searches “return”, “ret” would not be created to match “retail” and only “return” would show up. Yet, we kept the n-gram in the index analyzer, so if you searched for “ret”, the prefixes would still exist in the index to match both “return” and “retail”. Azure Search provides the n-gram filter as an orthogonal feature to its English analysis (separating token filters from tokenizers), so I could use both features together without interference. The search experience was then what I had imagined.

Conclusion

If you like the challenges of building distributed systems and are interested in solving complex problems, check out our job openings.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store