Fuzzy Database Search for Customer Details using Trigrams

Cristian Constantinescu
TrustYou Engineering
7 min readFeb 22, 2018

Abstract

Business is a dynamic game that often drives us to introduce new features that were not necessarily part of the roadmap right from the incipient stages of our projects. This can happen in various situations where we choose to embrace the constant change in order to make the most out of our new opportunities on the market.

Usually in product meetings stakeholders discuss to start with an initial iteration that corresponds to a minimum viable product (MVP) with the most basic use cases. As soon as the first version is stable in production, efforts start in building new fancy functionalities to make the user experience nicer. Sort of seems like this is how software evolution is happening nowadays in the increasingly accelerating world we are living. Every iteration is carefully monitored through the gathering of product usage statistics and new decisions are made based on the collected data.

In practice this frequently means that while the initial technical implementation will generally work well for the MVP, inevitably at some point requirements get introduced that demand more elegant solutions to solve a particular problem that is critical for the software system. Mostly situations like this require constructs that are more complicated than the naive implementations allowed by the tools available at that point in time.

Objective

The present article proposes to discuss a situation where we start with a very basic MVP and discusses how the design evolves to support new requirements.

In the beginning our MVP only features user profiles holding two contact methods: phone numbers and email addresses. With time, a new requirement is thrown in to develop a global fuzzy search than can efficiently look up a specific query term in the database considering matches in customer name and contact details.

In order to accomplish this objective, development follows several steps: activate new database extensions, build a precomputed customer search table, write triggers to keep data in synch and make use of specialized indexes for full text search and text similarity computation. The implementation is wrapped over a view that helps querying the most up to date data at all times.

Follow along by executing the code snippets in PostgreSQL (9.5+) or just skim through them to get the high level idea. Let’s deep dive into this adventure!

Basic Schema

A configuration of four tables is proposed as the starting point for the exercise. The tables are: business, customer, phone and email. The naming is suggestive, every customer belongs to a business and is in “has_many” relationship with email and phone tables that are used to store the contact details.

Fig. 1: Diagram with basic schema

Now let’s take a look at what to execute to get this created in our PostgreSQL database:

Fig. 2: Build the database schema

Looks like we already starting to have something building up! At this point we have the database schema that is ready to handle CRUD operations and perhaps lookups on a combination of tables with very basic optimizations. For the sake of creating a challenge let’s assume at this point that the inevitable occurs and the new requirement is thrown in to introduce support for a global fuzzy search by a query term in the customer name and contact details.

Just as with any other problem we can have several approaches on how this can be achieved, some of them simple while others more complicated. For starters there is the naive implementation that can consist of for example a join on these tables with a filtering condition involving an operator such as LIKE. There are several problems with this approach since both the join and the lookup can easily get costly as soon as the number of records increase significantly. Not to mention that we still need to support the fuzzy part of the requirement.

Instead of approaching the challenge from this simple perspective which doesn’t meet the requirements, we can try something else. For instance what about creating a table with precomputed data so as to eliminate the need for joins in the search? Good! However still not good enough, since we still need to compute also the similarity between the records and a specific search term in order to satisfy the requirement for the search to be “fuzzy”.

While there are certainly many ways to implement this requirement, for the sake of learning in the current article we will continue with a solution that uses trigrams.

Trigrams to the rescue!

In essence a trigram (or trigraph) is a group of three consecutive characters taken from a string. Such groupings can be used to efficiently measure the similarity of strings expressed in natural languages by counting the number of common trigrams. Read more on the dedicated page for this topic from PosgreSQL’s manual. PostgreSQL offers through the pg_trgm extension a couple of handy functions and operators useful in determining the similarity of alphanumeric text.

The other utility that we will use is unaccent which is a text search dictionary that removes accents from lexemes. In the following code snippet unaccent is used as a filtering dictionary that passes the output to the english_stem dictionary. The later is used as a snowball dictionary that is placed on purpose at the end of the dictionary list with the objective of recognizing everything.

Fig.3: Required PostgreSQL extensions

In the code listing above the mappings for the list of specified token types (asciiword, asciihwords, etc.) are altered with unaccent and english_stem dictionaries and the resulting “en” text search configuration is set as default.

With dependencies in place we are now ready to roll and create the new customer_search table. The main objective here is to eliminate the need for joins in the query by creating one common place to store all the data we need at this point for the fuzzy search. In addition we are throwing into the mix some indexes that will make things faster.

Fig. 4: The customer_search table with Indexes

There are three GIN indexes created: one on document (computed from name and term) and other ones for the term (email or phone) and name fields. (the gin_trgm_ops is needed by the pg_trgm extension). At the moment we have the table to store precomputed data with indexes and now its the time to consider how this will get populated with data. Assuming that we have no initial data in the tables whatsoever, we continue by taking a look at how this table can be kept up to date with the other tables as new records are created in the system.

Desirably our new precomputed table stays synchronized once new records get inserted into one of customer, email or phone tables. While certainly this synchronization can be obtained in many ways, in our exercise we will use triggers which are a built-in feature of PostgreSQL that is frequently used for similar purposes. Let’s take a look at how the triggers could look like for our currently existing tables:

Fig. 5: Customer table triggers
Fig. 6: Email and Phone table triggers

Now that’s a lot of triggers! That turned out to be a lot more verbose than initially expected. Anyway with the triggers in place we need one last piece to see the mechanism fitting together: a view with a refresh function.

Fig. 7: Refresh customer_search table
Fig. 8: The quick_customer_search view

Following the logic from the triggers that we configured for our tables we can notice that any record that is marked as stale in the customer_search table has in fact suffered a recent modification in one of the other tables, a modification that demands a data “refresh”. In this context refresh means to take the relevant records from customer, email and phone tables and precompute new values for the corresponding record from the customer_search table.

The purpose of quick_customer_search view is to create a wrapper over the querying logic while also making sure that stale records are refreshed.

That’s it! With this neat little setup we are able to finally perform our fuzzy search query:

Fig. 9: The customer search query

The path was long but we are finally there! After taking a more detailed look into the customer fuzzy search query we can conclude that the search is fundamentally a lookup on the quick_customer_search view. In the example we take the sequence of characters ‘john’ as search input and make use of the previously created indexes in order to perform the following operations: a match on document or similarity based filterings in term or name by picking only those records that have a similarity greater than the configured minimum threshold. Records that are not marked as deleted and that satisfy the previously mentioned criteria are included in the working set of the query.

The results are aggregated by customer_id in the sense that only the contact detail that is the “most similar” to the search input is considered for any given customer. The results are sorted desc based on rank_score (cover divergence) and sim_score (similarity score). Only the first ten top scoring entries are returned in the result set. Please note the usage of ts_rank_cd and similarity functions in order to compute the scores used in sorting.

Conclusions

That’s it! In this article we first started by creating a basic set of tables, then we introduced a new challenge to implement a global fuzzy search that required some other database elements like extensions, tables that hold precomputed values, trigram indexes and a custom view that wraps over the entire logic.

Once the parts of the infrastructure fell into their place we studied the execution of the global fuzzy search query by term in customer details and we seen how all the individual elements formed a mechanism that can efficiently accomplish the functionalities from the new requirements.

For readers who had the patience to read this far, here is a small gift with a visualization how the described feature is working in the TrustYou Messaging product:

Fig. 9: Customer search in TY Messaging

Special thanks to the other colleagues from TrustYou (and from former Checkmate) who worked on the TrustYou Messaging product feature with the similar name that greatly inspired this article. Well done guys!

Hope you enjoyed reading this article and until the next time I wish you only the best and happy hacking!

--

--