Building Blocks for Search Linguistics — dogs, canines & other furry things
When you think about getting relevant results for your queries, the last thing on your mind is Search Architecture. But, I’m going to explain why you should spend time on it to get the best out of matching terms or facet selections to the documents returned.
Are you thinking that we just need to search against an index?
Let’s look at the phrase “the dog walked into the room, it was furry”. Do we want to match on furry dogs, furry rooms, or just a dog walking into a room?
We need the right combination of query and index techniques to get the right results.
If we query “give me all the hairy hound canines”, how can we get this to match the phrase above?
One technique might be to introduce an ontology and/or synonym list containing equivalences (either one-way or 2-way). One entry in the list might be “furry dog” — “hairy hound”.
In combination, we might want to introduce a stemming and/or lemmatization list for words such as “dog” — “canine” (based on either an algorithm or a dictionary).
We may only have documents with content containing “furry dog”, but that is ok we are going to match.
So, the query might be translated from hairy hound canines to furry dog and we look up a list to provide the other terms and add them to the query.
Sounds expensive right? Hey but relevance is ok, isn’t it? In some cases, this might be ok, but let’s talk about that later.
The last thing we want to talk about is search architecture, however it’s crucial for us to address how we get our hairy canines to match. Plus, we want really quick response times.
Let’s see what we can do on the indexing side so our query logic doesn’t become a tangled mess of top heavy logic…
Let’s talk about index techniques. So, we index our documents into a search index by feeding content into fields or by crawling a website and grabbing the raw document structures and indexing them into fields. We also might index content from structured databases and/or other sources such as social media (Twitter feeds and the like).
These analysis pipelines can extract and normalise document sections and introduce linguistic matching and re-mapping of content into our engine.
We may choose to apply synonyms, lemmatisation, and or stemming to match our furry dog to hairy hound (indexing both furry dog and hairy hound into separate fields). Don’t forget we also want to normalise the words, and match regardless of case. (Perhaps we want to translate to another language — but that is another story).
So, let’s call one field Title.raw , the other, Title.synomym.
Title.raw contains furry dog, Title.synonym contains hairy hound.
At query time if we match against both sets of fields we can return a document that has a title of furry dog with a query such as “give me all the hairy hound canines”
So here is the dilemma — do we want lots and lots of fields and their associated weightings to match and sort through?(either manually generated or machine-learned generated). Or, do we just want to expand and match furry dogs to hairy hounds at query time? We might even conduct multiple search calls.
There is no right or wrong answer, however you should take the following factors into account:
- the size of your document corpus
- the number of queries your website or intranet generates
- what level of control you want over your expansions (such as real-time updates verses index updates)
If speed is crucial to your search experience, use the pre-mentioned pipelines and expand the index size. If speed is not so much an issue then consider expanding query side.
If you have 8 million documents in your index and re-indexing your content takes a day, then, to update furry dog to hairy hound in the synonym list using the index side technique won’t be feasible and the query side technique might be the best option.
What if you want real time updates?
Remember most search engines have to replace a whole document inside the index/shard to update just one field within that document.
If you have 100,000 documents and can re-index by swapping out indexes within minutes then definitely go for the index scenario. If not and you can afford the expense then consider the query side method.
Even better still, for regular updates go query side, for not so regular updates go index side, your lookups will be smaller and it might suffice with a hybrid model.
To the end user, we can dictate when we match those furry hounds so consider your architecture before jumping in.
Happy flea matching…