Dev Notes: Fuzzy Search on PostgreSQL

Ben Orozco
The Backlog by Ecaresoft
3 min readMay 6, 2016

--

My previous experience with search in Rails was 5+ years ago, then I used Sphinx as a Full-Text external search engine, which seems to have gone out of fashion.

At Nimbo X we use pg_search rails gem for searching purposes, with Full-Text already enabled in PostgreSQL itself, which has worked for us so far.

Nevertheless, with the increase of usage among practicians, along with our recent introduction of Medispan, support tickets related to mismatched search queries have become a cause of pain for the team. We knew our users deserve better.

#MayThe4thBeWithYou

First we needed to understand the current landscape on psql search. Luckily PgSearch documentation pointed us in the right direction: along with full-text search, other algorithms are supported through PostgreSQL contrib packages: fuzzystrmatch (Levenshtein, Double Metaphone) & pg_trgm (Trigrams).

Algorithms

Levenshtein (a.k.a. match difference)

Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

https://en.wikipedia.org/wiki/Levenshtein_distance

Soundex (a.k.a. match soundalikes)

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

https://en.wikipedia.org/wiki/Soundex

Double Metaphone (a.k.a. better soundex)

Metaphone is a phonetic algorithm… It fundamentally improves on the Soundex algorithm… which does a better job of matching words and names which sound similar.

https://en.wikipedia.org/wiki/Metaphone

Trigrams (match misspellings)

Trigrams are a special case of the n-gram, where n is 3. They are often used in natural language processing for doing statistical analysis of texts.

https://en.wikipedia.org/wiki/Trigram

Implementation

fuzzystrmatch

PgSearch makes fuzzystrmatch straightforward to implement:

$ rails g pg_search:migration:dmetaphone
$ rake db:migrate
class Word < ActiveRecord::Base
include PgSearch
pg_search_scope :that_sounds_like,
:against => :spelling,
:using => :dmetaphone
end


four = Word.create! :spelling => 'four'
far = Word.create! :spelling => 'far'
fur = Word.create! :spelling => 'fur'
five = Word.create! :spelling => 'five'

Word.that_sounds_like("fir") # => [four, far, fur]

pg_trgm

On the other side, pg_trgm queries are more expensive computationally. Creating a separate GIST (Generalized Index Search Tree) can be a palliative by increasing performance, although creating the index itself is expensive as well, and might involve some downtime. PgSearch makes pg_trgm a breeze to implement too:

class Website < ActiveRecord::Base
include PgSearch
pg_search_scope :kinda_spelled_like,
:against => :name,
:using => :trigram
end

yahooo = Website.create! :name => "Yahooo!"
yohoo = Website.create! :name => "Yohoo!"
gogle = Website.create! :name => "Gogle"
facebook = Website.create! :name => "Facebook"

Website.kinda_spelled_like("Yahoo!") # => [yahooo, yohoo]

Rails Migrations

It is nice to have rails migrations that activate the contrib packages in Postgres:

Resources

https://gist.github.com/benoror/e78f6bdff1b02de7ceed28c9b08fe435

PgSearch

Fuzzy Search

fuzzystrmatch

pgtrgm

--

--