Fast Text Search to Boost User Experience in Kampus Merdeka Platform

Published in

GovTech Edu

9 min readJan 26, 2023

Does your application have a feature to find some words, names, or phrases? Is it not fast enough? You should read this one.

Text Search Feature in the Application

Text search is a common feature many applications have. We usually use this feature on Instagram, Twitter, or any social media to find someone by their account username, first name, or last name. We use this feature to find an address in Google Maps, (Apple) Maps, Waze, or any Map Navigation.

In Kampus Merdeka, a platform that allows Indonesian college students to hone their skills outside their home university through several choices of programs (internship, independent study, student exchange, and so on), the text search feature is used to ease up the student to find any programs that can match their needs, to help ”mitra” a.k.a. Industry partners to find a specific mentor, and to support all of our stakeholders for stuff related to the information seeking and verification, to make Indonesian higher education better than ever.

Database to Store Texts

These days, we know that the choice of database technology that can potentially fit product needs is abundant. For example, we can choose a relational database such as MySQL, PostgreSQL, and so on or a non-relational database like MongoDB or Redis.

Considering many factors, we choose to utilize PostgreSQL as the main database management system in Kampus Merdeka platform to store the data of students, industry partners, and universities.

Why PostgreSQL? About that one, we probably need to write a separate blog post since it is an interesting story on its own :)

In this article, we share a bit of our experience on how to speed up our text search feature in Kampus Merdeka platform with a simple manipulation in PostgreSQL.

Existing Method

We, GovTech Edu, highly value user-centricity in developing our technology product. Putting users’ needs at the center when designing and developing the product is the way forward for the Kampus Merdeka platform to generate real impact. One of the aspects that we strive for is ensuring that some critical features' response time is in a “good enough” state since it is highly correlated with user satisfaction.

Ensuring fast response time for the text search feature is worth pursuing since it is quite heavily used. The initial text search feature was built through the standard pattern matching technique in PostgreSQL, i.e., using LIKE operator. We can form a simple SQL query to use the LIKE operator. Below is an example:

SELECT name FROM mentors WHERE name LIKE ‘%ridho%’;

Let’s say that we have data in our database like this:

When we run the SELECT query, it will return the name string from row number 2, “ridho rhoma.” The result is correct, as expected.

Everything is fine until it’s not; that is when we have a huge amount of text data to be queried. . In a situation where we have ~1 million rows of data containing names of persons with around 5 words for each name, the query takes at ~210ms planning time and ~460ms execution time.

The 460ms execution time seems quite fast. However, what if the number of data becomes larger and larger? Or what if our application receives an abundance of simultaneous requests from users? Surely we want to avoid the response time being above the psychological thresholds that may decline the user experience.

Full Text Search

We want a faster response time for our text search feature. But where to start?

These days “Googling” is perhaps one of the most essential skills that everyone should have :) We started by searching “how to find a word from a keyword in Postgres” and stumbled upon the Full Text Search feature in PostgreSQL, which could be an alternative method to the LIKE operator.

Implementing Full Text Search

Implementing the Full Text Search (FTS) in Postgres is similar to using the LIKE operator, i.e., text query-document matching. In FTS, the text matching is conducted by the operator @@ given <doc_text> as the document and <query_text> as the query, e.g., <doc_text> @@ <query_text>.

In order to run the FTS properly, each text is explicitly identified by either of the following data types: tsvector and tsquery. The tsvector provides the preprocessed documents that are splitted into lexemes, while tsquery represents the normalized search terms, which are also in the form of lexemes. Below is an example to run full text search query (copied from Postgres docs):

SELECT ‘a fat cat sat on a mat and ate a fat rat’::tsvector @@ ‘cat & rat’::tsquery;

The query will return true, because cat and rat exist at the vector data of “a fat cat sat on a mat and ate a fat rat” words.

So far, the FTS approach seems normal, i.e., the task above can still be done using the LIKE operator. That statement can be true in some simple cases, but FTS will provide more values when things get more complicated.

We notice one important functionality in the FTS query: it converts the text documents and queries into normalized tokens/lexemes, which are more favorable for text search with some ambiguity. The normalized documents can be pre-calculated within the database creation to allow a much faster query process.

Handling the text matching with ambiguity is complicated or impossible to achieve through the LIKE operator, in which manipulating the query with wildcards or regular expressions is the only thing we can do. In short, FTS is more equipped with the linguistic support that makes our life easier in developing the search feature. Now let us get our hands on simulating implementing the FTS in Postgres. We initially create a simple DB table called MOCK_DATA in which the “name” column contains the documents to be queried as follows:

CREATE TABLE mock_data (
id SERIAL
name VATCHAR(100),
created_at timestampz,
ts tsvecror generated always as (to_tsvector('english'::regconfig, name))
);

Note that the most important part of the table creation script above is the existence of the “ts” column, in which the lexemes of documents in the “name” column, generated by to_tsvector() function, will be stored. The future queries with the tsquery data type will be compared against the lexemes in the “ts” column.

Indexing

One of the most impactful ideas in information retrieval is indexing, which is about collecting, parsing, and storing data to facilitate more efficient retrieval. Indexing has become an indispensable capability in any search engine. Without indexing, the text search process needs to scan all rows in the corresponding table, which can be intractable as data grows. This means we should embed an indexing process in the FTS to gain more speed in the query execution.

PostgreSQL provides two kinds of indexing that can be used in the Full Text Search: Generalized Search Tree (GiST) and Generalized Inverted Index (GIN). Which is more suitable for our case? We ended up choosing GIN as our indexing method. The consideration is two-fold. First, GIN fits more naturally with the FTS’s lexemization outcomes since it was designed to deal with data-types that can be divided into sub-elements / tokens and we want to search for those individual element values. Second, we prefer a simpler yet effective indexing method, which is fulfilled through GIN rather than GiST — in GIST indexing we need to configure the additional siglen parameter to obtain expected results.

Here is how we implement the GIN indexing. We first seeded some data into the mock_data table that was created before. Around 1 million rows of data were populated with the query as follows:

INSERT INTO mock_data (name, created_at) VALUES (?, ?);

Note that the data populated into the “name” column are a person’s full name with up to 5 words.

Then we applied the GIN indexing on the table based on the tsvector lexemes stored in the “ts” column:

CREATE INDEX trgm_idx_fts ON mock_data USING GIN (ts);

We can see that embedding the GIN is pretty simple.

Testing

To verify our hypothesis that using FTS plus GIN would result in faster text search, we ran a few search query simulations and captured both the planning and execution time with different settings as follows:

LIKE operator without indexing
LIKE operator with GIN
FTS without indexing
FTS with GIN

All the simulations were executed inside a Dockerized PostgreSQL 13 on the same machine. The following are the detailed results:

The simulation above shows that the FTS with GIN indexing generates the fastest query among the other settings. Furthermore, GIN's impact significantly reduces the overall query time in both LIKE and FTS cases. The GIN indexing can be effective with the LIKE operator when the pg_trgm extension is installed in Postgres, potentially introducing more production complexity.

Actual Impact on Kampus Merdeka Platform

The previous simulation provides us with a convincing insight that we should implement the FTS and apply GIN indexing in the real world. One of the use cases is the mentor’s name search when a Mitra needs to add a previous Mentor to a new program. This was initially powered by the query using LIKE operator under the hood and we revamped the query implementation by incorporating FTS and GIN like what we did in the simulation.

The text search feature is triggered when the user types the Mentor’s name as shown below:

We collect the response time data retrieved from Prometheus that monitors the API calls related to the search feature, before and after the FTS with GIN is implemented. Taking the aggregated response time in the form of (min, max, average) over 2 weeks before and after the FTS implementation, we observe a quite significant speed improvement:

Before optimization

After Optimization

This confirms the real-world effectiveness of the FTS with GIN in speeding up the mentor’s name search over the LIKE operator in Kampus Merdeka platform.

Conclusion

The Full Text Search (FTS) and Generalized Inverted Index (GIN) in PostgreSQL has been around for quite a while and are versatile weapons as building blocks for any search features in apps. It is expected that the FTS with GIN provides better performance for the text search feature than the pattern matching with LIKE operator in terms of the basic search functionalities and query speed. FTS and GIN are also relatively simple to implement.

Due to the familiarity with the LIKE operator, it is sometimes tempting to use that approach as the backbone of the text search feature. However, we recommend that the FTS and GIN are utilized in the first place for a more optimized text search feature.

We eventually implemented the FTS with GIN in all parts of Kampus Merdeka platform equipped with text search capabilities so that the overall response time of the platform is reduced and, therefore, user satisfaction improves.

Authors

Ridho Perdana, Software Engineer specializing in Backend at GovTech Edu, specifically in Kampus Merdeka Platform. He is responsible for identifying the problem, implementing solutions, and ensuring the solution that has been created runs as it should. Before Joining GovTech Edu, he had experience as Senior Backend Engineer at Amartha Mikrofintek and Kurio (News Aggregator).

Muhammad Ghifary, Head of Engineering at GovTech Edu who is responsible for the development of Kampus Merdeka platform. Before joining GovTech Edu, he had experiences in both academia and industry in several companies: Bank Rakyat Indonesia, Bukalapak, and Weta Digital. He was involved in the filmmaking of many blockbuster movies, notably as an AI research engineer in Avatar 2: The Way of Water. He earned a doctorate degree in computer science in the area of artificial intelligence.

References:

Postgres Text Search: Full Text vs Trigram Search — Aapeli Vuorinen

Postgres (or PostgreSQL, if you like) is an amazingly powerful database engine. I’ve recently been working hard on…www.aapelivuorinen.com

12.9. Preferred Index Types for Text Search

There are two kinds of indexes that can be used to speed up full text searches: and . Note that indexes are not…www.postgresql.org

Optimizing Postgres Text Search with Trigrams

In this post, we’ll implement and optimize a text search system based on Postgres Trigrams. We’ll start with some…alexklibisz.com