Dev Notes: Fuzzy Search on PostgreSQL

Ben Orozco
May 6, 2016 · 3 min read

My previous experience with search in Rails was 5+ years ago, then I used Sphinx as a Full-Text external search engine, which seems to have gone out of fashion.

At Nimbo X we use pg_search rails gem for searching purposes, with Full-Text already enabled in PostgreSQL itself, which has worked for us so far.

Nevertheless, with the increase of usage among practicians, along with our recent introduction of Medispan, support tickets related to mismatched search queries have become a cause of pain for the team. We knew our users deserve better.

#MayThe4thBeWithYou

First we needed to understand the current landscape on psql search. Luckily PgSearch documentation pointed us in the right direction: along with full-text search, other algorithms are supported through PostgreSQL contrib packages: fuzzystrmatch (Levenshtein, Double Metaphone) & pg_trgm (Trigrams).

Algorithms

Levenshtein (a.k.a. match difference)

Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

https://en.wikipedia.org/wiki/Levenshtein_distance

Soundex (a.k.a. match soundalikes)

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

https://en.wikipedia.org/wiki/Soundex

Double Metaphone (a.k.a. better soundex)

Metaphone is a phonetic algorithm… It fundamentally improves on the Soundex algorithm… which does a better job of matching words and names which sound similar.

https://en.wikipedia.org/wiki/Metaphone

Trigrams (match misspellings)

Trigrams are a special case of the n-gram, where n is 3. They are often used in natural language processing for doing statistical analysis of texts.

https://en.wikipedia.org/wiki/Trigram

Implementation

fuzzystrmatch

PgSearch makes fuzzystrmatch straightforward to implement:

class Word < ActiveRecord::Base
include PgSearch
pg_search_scope :that_sounds_like,
:against => :spelling,
:using => :dmetaphone
end


four = Word.create! :spelling => 'four'
far = Word.create! :spelling => 'far'
fur = Word.create! :spelling => 'fur'
five = Word.create! :spelling => 'five'

Word.that_sounds_like("fir") # => [four, far, fur]

pg_trgm

On the other side, pg_trgm queries are more expensive computationally. Creating a separate GIST (Generalized Index Search Tree) can be a palliative by increasing performance, although creating the index itself is expensive as well, and might involve some downtime. PgSearch makes pg_trgm a breeze to implement too:

Rails Migrations

It is nice to have rails migrations that activate the contrib packages in Postgres:

Resources

https://gist.github.com/benoror/e78f6bdff1b02de7ceed28c9b08fe435

PgSearch

Fuzzy Search

fuzzystrmatch

pgtrgm

The Backlog by Ecaresoft

Sharing our journey: from software development to company…

Ben Orozco

Written by

Healthtech Hacker — Full Stack Dev — Open Source & Crypto Enthusiast — CTO 🌳 @HealthTreeNet — Previously @ecaresoft & @Nimbo_X

The Backlog by Ecaresoft

Sharing our journey: from software development to company culture and productivity hacks.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade