The “TLDR”:
- A basic full text search implementation in Postgres looks like this:
tsvector_column @@ plainto_tsquery(user_input)
- The SQL to generate the tsvector column from a table field ( and continuously update) is
ALTER TABLE mytable
ADD COLUMN searchable tsvector GENERATED ALWAYS AS (
to_tsvector('english', coalesce(column, '')), 'A') ||
) STORED;
- You should add a GIN index to this column as well
- Alternatively, you can use the pg_trgm extension for “fuzzy” search
Why full text search?
The simplest way of implementing search in a relational database is using a LIKE%
query. However, using this query simply returns matching all result containing the targeted phrase without any ranking or indication of relevance. In contrast, depending on how it’s implemented, Full Text Search allows for
- Match multiword phrases even if the words aren’t in the same order or contain filler words. For example, searching for “how to train your dragon” using full text search might return “5 ways to train a dragon”, even though though the phrase “how to train your” isn’t in the result.
- Normalizing words so that conjugation and suffixes don’t limit the results. For example “training your dragon” could still return the result “train your dragon”
- Removing “stop words”, which are common words like “the”, “of”,
“for” and “to” so they aren’t part of the requirements to create a match. - Allowing users the option to search for either exact or inexact phrases by using quotes, or using OR statements.
Implementation
To demo this functionality using Postgres, I built a search function for good read quotes and Star Trek quotes with Postgres database using Ruby On Rails
The search function has two “stages” — parsing the user input into as tsvector
containing the words to search against, and comparing that tsvector
against the an index of other tsvectors
containing the normalized quotes from the database. The to_tsvector
funciton is responsible for removing stop words and trimming verb endings such as “ing” (depending on the dictionary used).
Note: you can also use the pg_trgm extension to perform “similarity” or fuzzy search, which is better at catching mispellings of words. This isn’t covered by this article, but you can read about it here
To get started, let’s create the “dialogues” table, each “dialogue” consisting of a line, character, series and episode.
class CreateDialogues < ActiveRecord::Migration[7.0]
def change
create_table :dialogues do |t|
t.string :line
t.string :episode
t.string :series
t.string :character
t.timestamps
end
end
end
Next, let’s add the column we will search against.
def up
execute <<-SQL
ALTER TABLE dialogues
ADD COLUMN searchable tsvector GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(line, '')), 'A') ||
setweight(to_tsvector('english', coalesce(character,'')), 'B')
) STORED;
SQL
add_index :dialogues, :searchable, using: :gin
end
Lets break this down. I am creating a tsvector field through a subquery. GENERATED ALWAYS means that this field will automatically update every time the fields used in query change, in this case line
or character
. I am using the “setweight” function to indicate that I am matching primarily against the characters lines, but we also want to match against the characters names. That way, I can search for the name of the character and see all their lines, as well as searching dialogue, in a single query.
Take note that the SQL ||
operator represents concatenation and not “OR” as in some other programming languages.
Next, I add an GIN index to this newly created field. This allows quicker searching.
Finally, lets handle the query:
class Dialogue < ApplicationRecord
conn = ActiveRecord::Base.connection
scope :search, -> (query){ select("dialogues.*, ts_rank_cd(dialogues.searchable, websearch_to_tsquery('english',
#{conn.quote(query)})) as rank").where("dialogues.searchable @@ websearch_to_tsquery('english', ?)", query).order('rank DESC')}
end
Above, I use websearch_to_tsquery
to convert the users words into a query. Normally, ts_query requires a specific syntax. “ fat & cat” to search for documents containing both fat and cat, and ‘|’ for either fat and cat. websearch_to_tsquery
however is able to parse queries using english such as “fat OR cat”. Read more about parsing queries here:
For ranking the search results, I applied the ts_rank_cd
function, which is a ranking function which takes into account the postion of the words in the phrase. I chose this function in order to bias search results towards similar “phrases” over ranking solely by the number of matching words.
This is all I needed to implement the search functionality! With the code above implemented, I could call Dialogue.search("Picard")
to get all of that characters lines, or I could call Dialogue.search("Transporter Accident")
to get all of the lines containing the words “transporter” and “accident”, and so on.
Below are some screen shots of the search in action.