Comprehend TSVECTOR and TSQUERY in Postgres for Full-Text Search

Published in

Geek Culture

10 min readOct 12, 2022

Introduction:

Full-Text Search searches the documents that match with a query and, optionally, sort the records according to their relevance to the query. A document, such as a magazine article, email, or database table serves as the search unit in a full-text search system. The most common type of search returns all documents that contain query terms in the order of their similarity to the query. The concepts of query and similarity are very flexible and vary depending on the application. The most fundamental search treats the query as a group of words and compares them based on how frequently they appear in the document. It is a technique used by web search engines such as Google, Firefox, and others.

The text search engine must be able to parse documents and link lexemes (keywords) to the parent document. These associations are then used to find documents that contain query words. You’ve probably heard of the ‘ilike’ and ‘like’ operators in SQL databases, which are used to search text. The ability of the ‘like’ and ‘ilike’ operators to perform full-text search analysis tasks is limited.

Problem with Like operator:

The like operator searches for exact words and cannot search for word variants. For example, if you search for the word ‘produce’ in a document where some records only contain the word ‘producing,’ the query will not return those records. When searching for words in any form, you must explicitly include the like clause for each variant, which can be time-consuming if you need to find a large number of words.
The Like operator does not return a ranking for documents that match the query term. For example, if a query word appears more than once in multiple documents, the document with the most occurrences of the query term is more important than the document with the fewest occurrences. The like operator does not provide a relative ranking among multiple documents.
The like operator works very slowly on large documents, leading to poor performance.

The same problems we have with the “like” operator persist when we use the “ilike” operator. The only difference is that ‘ilike’ is not case-sensitive.

Full-text search in PostgreSQL: tsvector and tsquery

PostgreSQL offers two data types, tsvector, and tsquery, to aid in full-text search because the “like” and “ilike” operators are constrained in their ability to carry out full-text search analysis tasks. TS stands for text search in tsvector and tsquery. The tsvector type is used to represent a text-searchable document, whereas the tsquery type is used to represent text queries in the text-query format. In Postgresql, both are used to perform full-text searches. A tsvector is a sorted collection of individual lexemes, which are normalized words that include multiple variants of the same word.

We’ll go over the to_tsvector() and to_tsquery() functions in depth, which return the tsvector and tsquery data types, respectively.

In Postgres, one table is created with the following syntax, and 4 records are added to it. This is the code:

create table document_text(file_id int, content text, description text);insert into document_text values(1, 'The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.','Impact of bull market on investors');insert into document_text values(2, 'A bull market is one in which the market is rising and the economy is doing well, whereas a bear market is one in which the economy is collapsing and most equities are losing value. Although some investors are "bearish," the vast majority of investors are "bullish." Over lengthy time periods, the stock market has generally produced positive returns.','Market: positive and  negative sentiments');insert into document_text values(3, 'During a bear market, market sentiment is negative, and investors begin to shift their money away from stocks and into fixed-income instruments as they wait for the stock market to recover. To summarise, the drop in stock market values has shaken investor confidence. As a result, investors keep their money out of the market, causing a general price decrease as outflow grows.','Bull and bear market');insert into document_text values(4, ' the decrease in stock market values has shaken investor confidence, As they would have predicted that the market will fall more.  This causes investors to keep their money out of the market, causing a general price decrease as outflow grows. In bear market, value of stocks falls while in bull market the value of stocks falls. However the market has been producing the positive result over length periods.','Impact of price decrease in market');

The table looks like this:

1. How to use to_tsvector()?

to_tsvector() is Postgres’s internal function that returns a tsvector for the text or document passed as an argument, which can be used to perform efficient text searches. The tsvector is a sorted collection of key-value pairs, where the key represents the lexeme and the values represent the lexeme’s position. Lexemes are normalized representations of words. There are no stop words in tsvector, such as ‘has,’ ‘in,’ or ‘as,’ and lexemes are sorted alphabetically.

In the query below, for example, the confidence is converted into the ‘confid’ lexeme and found at position 10. The lexeme ‘market’ appears twice, at positions 5 and 15, therefore, followed by two numbers.

Query:

select to_tsvector('The decline in stock market values has shaken investor confidence, as they expected the market to fall further.');

Output:

2. How to use to_tsquery()?

to_tsquery() returns a tsquery for the searched pattern in the text (or document). Each word’s normalized lexeme representation differs slightly from the word itself; for example, confidence has been transformed into a ‘confid’ lexeme. Searching confidence without using the to_tsquery() function returns false because Postgres internally casts the word ‘confidence’ into the tsquery data type, which returns ‘confidence,’ which is not present in the ts vector.

Query 1:

select to_tsvector('The decline in stock market values has shaken investor confidence, as they expected the market to fall further.') @@ 'confidence';

Output 1:

Difference between tsquery typecast and to_tsquery() function:

As shown in the following query, typecasting to the tsquery data type does not produce accurate results in Postgres. When we use the cast operator(::) on the word confidence, we get the word ‘confidence’ back. the normalized representation of confidence, ‘confid,’ is returned by to_tsquery() in contrast. If PostgreSQL functions are available, you should always use them for casting instead of the cast operator. Poor query performance is also caused by the cast operator.

Query 2:

select 'confidence'::tsquery, to_tsquery('confidence');

Output 2:

If we use to_tsquery() around ‘confidence’ then it will return true.

Query 3:

select to_tsvector('The decline in stock market values has shaken    investor confidence, as they expected the market to fall further.') @@ to_tsquery('confidence');

Output 3:

to_tsquery() can be used to form queries for multiple words using boolean operators like and, or, and not. We will see them one by one on the above-created table document_text.

2.1. to_tsquery() with And(&)operator:

Multiple words separated by ‘&’ can be passed as arguments into the to_tsquery() function to match all of them within a document. The query below looks for records where the content field contains all of the words ‘confidence,’ ‘investors,’ and ‘prediction’ in any form and in any order. The record with file_id 4 is returned because it contains all three words: confidence, investors, and predicted given in the query.

Query 4:

select file_id, content from document_text where to_tsvector(content) @@ to_tsquery('confidence & prediction & investors');

Output 4:

2.2 to_tsquery() with Or(|)operator:

Multiple words separated by or operator(‘|’) can be passed as arguments into the to_tsquery() function to match any one of them within a document. The following search looks for those records which contain either “negative” or “economy,” or both, in any order in the content field. The records with file ids 2 and 3 meet the specified criteria therefore present in the output.

Query 5:

select file_id, content from document_text where to_tsvector(content) @@ to_tsquery('negative | economy');

Output 5:

2.3. to_tsquery() with Not(!)operator:

To find records that do not contain the specified word, use the NOT operator (‘!’). Not(!) is a unary operator, therefore appearing before the word. The following query looks for records where the content field does not consist word ‘bear’ in any order. Only the record with file id 4 does not consist of the word ‘bear’ therefore present in the output.

Query 6:

select file_id, content from document_text where to_tsvector(content) @@ to_tsquery('!bear');

Output 6:

2.4. to_tsquery() with Multiple boolean operators:

Most of the time, we’re looking for records that contain a specific combination of words, some of which must occur while others should not. This can be accomplished by combining multiple boolean operators in a single tsquery. For example, To find records that include the word ‘predict’ but not ‘bear,’ or records that include the word ‘instrument’, we can pass ‘(predict & !bear)| instrument’ inside to_tsquery() function. Records with file_id 3 and 4 are returned because the word “instrument” appears in the record with file_id 3 and the word “predict” but not “bear” appears in the record with file_id 4.

Query 7:

select file_id, content from document_text where to_tsvector(content) @@ to_tsquery('(!bear & predict)| instruments');

Output 7:

Optimization for fast text search:

Full-text searches are slower on large datasets. To improve the performance of full-text search, Postgres offers the option of indexing on the tsvector field, which leads to faster query execution. You can create an index on your table to speed up and optimize your text search for larger documents. GIN is the most commonly used index for full-text searches. The Postgres documentation suggests two approaches: one is to create the to_tsvector() field directly in an index, and the other is to first create another tsvector column in your table using to_tsvector() and then create a gin index on this field.

We will follow the second approach to create an index. We can also specify multiple columns inside to_tsvector() using concatenate operator. We will modify the table to include a tsvector field on the field content and description named tsvector_content_desc before creating the GIN index. We have used the coalesce() function on the content and description to replace them with an empty string if contains null values.

Query to add columns:

alter table document_text 
       add column tsvector_content_desc  tsvector 
       GENERATED ALWAYS AS (to_tsvector('english', coalesce(content, '') || ' ' || coalesce(description, ''))) 
       STORED;

Output:

Query to create an index:

CREATE INDEX textsearch_idx ON document_text USING GIN (tsvector_content_desc);

Output:

Now we can use the tsvector_content_desc field to perform a text search in the following query. You can notice the time consumed by this query is 0.88 ms which is significantly lower than other above query execution times which are executed without using an index.

Query:

select file_id, content 
       from document_text 
       where tsvector_content_desc @@ 
                     to_tsquery('(!bear & predict)| instruments');

Output:

Conclusion:

Postgres’ full-text search feature is fantastic and relatively fast (enough). It will enable your application to grow without the need for another tool.

Key Takeaways:

The ‘like’ and ‘ilike’ operators can not be used for full-text searches for larger documents due to their limited capability to perform a full-text search.
tsvector and tsquery are the datatypes available in Postgres, which overcome the limited capability of ‘like’ and ‘ilike’ operators for full-text search.
to_tsvector() and to_tsquery() are the function available in Postgres for searching the variants of query terms.
Indexing can be done on the tsvector field to do full-text search faster and more efficiently for larger documents to avoid any performance issues.

You can select the best text search strategy per your application requirements. Knowing the available data types and functions for text search in Postgres will assist you in deciding which data types to use in your database. If the application requires a short and exact word search word, you can use the ‘like’ and ‘ilike’ operators. If your application requires more complex searches for larger documents, you can use the to_tsvector() and to_tsquery() functions.

References:

https://www.postgresql.org/docs/current/textsearch-tables.html

Comprehend TSVECTOR and TSQUERY in Postgres for Full-Text Search

Introduction:

Full-text search in PostgreSQL: tsvector and tsquery

1. How to use to_tsvector()?

2. How to use to_tsquery()?

Optimization for fast text search:

Written by R. Gupta