This summer, I interned with a new team at Jet.com that is building a multi-tenant knowledge base system for customer FAQs, service agent FAQs, and chat-bot data. My focus has been building out our search functionality. Successfully building a greenfield project involves getting many elements right from the start, and my hope is to share some main takeaways from the summer that will help you build new projects with top-notch search experiences.
Search is often one of the most important elements of a content-centered service because it’s the main way people find the content they are looking for precisely and quickly. Search is prominent in our knowledge base system too. When customer service agents open our application or when customers use the help center on one of Walmart’s e-Commerce sites, they’ll usually have a particular topic they need to learn more details about, and we plan for search to be their go-to approach.
Here is a brief overview of the main search-specific terminology that will be used in this article and in documentation you will find. The specific use of each term varies, so I focus on the way they are used in Azure Search.
Firstly, the index defines what data is stored and how for each entity or document to be searched. Each piece of data or field within a document can have certain properties, including being searchable, sort-able, and filterable. Each field’s properties are defined in the index. Each field is subsequently split or tokenized into words or terms (or tokens). Each field has one analyzer that specifies both tokenization (e.g. finding words separated by white space and transforming them into their root form in English) and filtering (e.g. stripping out HTML) on a granularity of either tokens or characters. The analyzer is also specified in the index. When a document is indexed, each filtered token for each field is stored in the index, or indexed. A popular token filter is the EdgeNGram, as Lucene and Azure Search call it. An edge n-gram is one flavor of an n-gram, which is itself “a contiguous sequence of n items from a given sample of text or speech”. In the case of edge n-grams, the items are characters, and an edge n-gram filter creates contiguous combinations of characters starting from one end of the word or the other (not the middle). An edge n-gram filter allows for matching prefixes of a word to occurrences of the full word in searchable text (e.g. a search for “ret” matches an article with “return”), which is called prefix matching. An edge n-gram starting from the beginning of each word with a min n-gram size of 1 and a max n-gram size of 20 would transform the term “return” into “r”, “re”, “ret”, “retu”, “retur”, and “return”: all of which would be indexed. The indexer keeps the data stored in the index up to date. In our system design, the indexer is its own microservice that listens to the database for changes.
Choosing your Search Technology
Understanding your Priorities
Now, what are the main things to consider when you’re figuring out how to build search into your application or service? The first is a resource prioritization decision: do you want to build your solution from the ground up using lower-level search technology like Lucene, or do you want to use a higher-level system like Azure Search that provides a more abstracted API with extra features built in and some design designs made for you? Of course, this depends on resources and time available, as well as the system’s requirements for integration and customization. For my team, we needed to get this system up to scale across multiple companies (aka tenants) and multiple user groups with only a handful of engineers, so it was necessary for us to pursue a solution that could nearly be used “out of the box”. We decided to go with Azure Search, Microsoft Azure’s search service, but I planned for future changes if priorities change down the road. Azure Search is one of many third-party options, but the same considerations go into deciding on any search technology out there. We will step through the important ones.
Trust and integration
The first question to ask about a search technology is if you trust the third-party that created it, or if the solution is open-source and looks like a good starting point to build off of. For us at Jet, we are already a Microsoft Azure shop, so going with Azure Search was almost second nature. There is also the consideration of how the search technology will plug in with your existing systems. Storing search content in its own index brings both performance and scalability benefits, but this requires integration with the primary data store, or the “source of truth”. Fortunately, this decision also became much easier at Jet this summer as we were integrating with other Microsoft services such as Blob storage and Cosmos DB. For instance, Cosmos DB provides a change feed to easily subscribe to changes that occur in the database. Microsoft also provides its own out-of-the-box indexer that reacts to changes in Cosmos DB, but we decided that we wanted more control over this part of our pipeline. In particular, we wanted to make custom formats and interpretations of our data available for search, so building a custom indexer microservice was more appropriate for us. We have a strong tradition of building microservices at Jet, so this was a relatively easy thing to fit into our ecosystem as a custom solution.
This takes me to a larger point about customization. By its nature, this consideration is very specific to what you’re building and for what purpose. There are a lot of knobs and switches you will want to maneuver when creating a search experience, so make sure there is a path for accomplishing what you need, and you have the amount of control you want/need over performance specifics. This summer, the most important functionality for us ended up being the integration between language-based tokenization (potentially for multiple languages) and prefix matching. The way these two significant search functionalities work together will likely be critical to understand before implementing a system with them both. This turns out to work well in Azure Search (more on that later), although complexity is added when handling multiple languages since a given field can currently only have one language tokenizer. It’s also handy to have a plan in place for making changes to your search as things shift and grow. For instance, most index updates in Azure Search require deleting the index, recreating the one with the desired features, and then repopulating it with the content. This particular set of situations can be covered pretty neatly in production by creating a new index, populating it, and then swapping over to the new one with no downtime, but the sooner you have a plan in place the better.
If performance is a concern, it will also be useful to know the underlying technology for your service and what you can expect based on its strengths and weaknesses. What are the low-level implications of high-level settings like prefix matching or language-based tokenization, which could take your search experience from working smoothly to moving like molasses? Having documentation and a community surrounding the search technology is particularly helpful to understanding these subtleties, especially standout features or temporary work-arounds. For example, Azure Search is built off of Lucene, and Lucene itself is a well-documented, open-source text search engine library that has quite a following. Azure Search also has good documentation of its own for anything from scaling up computing resources to knowing which replica/partition configuration provides the needed queries per second.
At this point, it’s critical not to jump in and start implementing. This is a unique opportunity to design your system from scratch and choose exactly what you want to commit to and expose! You now have full control over your feature set. If you directly plug your third-party search service into other services (such as a front-end application) and then want to change underlying search systems down the line, you have a lot to change. This becomes especially complicated if you are not the person writing the front-end, or if you are integrating with several front-ends with the same back-end search. For this reason, I didn’t have the Search users call Azure directly. An API will go a long way here to helping your work now become something that can handle current needs and evolve to fit future needs.
Below is a simplified version of my search API endpoint.
An example of what a user sends the API in the body of a POST request:
And here is what I then send to Azure Search behind the scenes
"filter":"createdBy eq 'John.Smith'",
...and other params for number of docs returned, highlighting...
The API endpoint specifies the functionality that we provide, while the implementation details of proxying the request to Azure Search is hidden. By providing a layer of indirection between the search service and the front-end(s), I can change the underlying service to Elasticsearch or any other technology by only rewriting the translation from our API’s format to the search service’s format. The front-end need not change, and that makes sense because nothing on their side is changing. By leveraging good abstraction, I made our system more reliable and backward compatible, which in turn minimizes interface changes to maximize healthy growth of our system as we add new features.
Tune the Parameters
Now that you have picked the technology, made sure it fits your needs, and figured out how you will expose its functionality, it’s time to plug it into your system and try it out. You’ll need to tune many parameters to get the desired search experience. I relied on quick iterations: making changes and then making queries that included the general case and edge cases. As I entered more and more queries and read through their results, I discovered enhancements I would see in the likes of Google Search, and hunted down the implementation with Azure Search. One of the last ones I added was support for partial matches (specifically for the prefix of each term), which came in two parts. The first was choosing an “EdgeNGram” token filter to create a term for each combination of characters starting from the start of the word, so if a user types “ret”, he or she will find “return” since “ret” is a term associated with any complete words of which it is a prefix. The second part was getting the exact right behavior for the prefix matching. When we first added the EdgeNGram, our analyzer would find an article containing “return” if you searched for “retail”. Through all these changes, I also wanted to keep the English analysis that reduced words to their root form before matching terms (so “buy” matches “bought”, for instance).
Where to use EdgeNGram (Y = correct behavior, N = incorrect behavior)
I ended up discovering that we needed to separate the analyzer into two parts: one for indexing and one for searching the content. If you choose a generic “analyzer”, the analyzer is used for both by default. When we included the n-gram on both indexing and searching, the search terms were broken into n-grams and matched against the searchable content. “Retail” was then broken into prefixes (including “ret”), which would then match “return” in an article (since “return” also has the prefix “ret”). This was the wrong behavior: a user generally expects only the entire input to be matched against the content, not its prefixes as well. To fix this, I removed the n-gram filter from the search analyzer. Then when the user searches “return”, “ret” would not be created to match “retail” and only “return” would show up. Yet, we kept the n-gram in the index analyzer, so if you searched for “ret”, the prefixes would still exist in the index to match both “return” and “retail”. Azure Search provides the n-gram filter as an orthogonal feature to its English analysis (separating token filters from tokenizers), so I could use both features together without interference. The search experience was then what I had imagined.
These are what you might call “Bryan’s Best Practices” for building a search system: choose a service or a custom solution, abstract the interactions, and tune the parameters. In the end, you’ll have a unique search experience that is customized for your application’s functionality and your system design. Developing search over the summer and getting the product as a whole ready for release has been an exciting adventure, and I hope this article gives some insight into what goes into a search system and what to think about as you choose your approach. Best of luck with your own projects!
If you like the challenges of building distributed systems and are interested in solving complex problems, check out our job openings.