Reaching the limits of full text search on Postgres

Kevin Lee
Gem Software
Published in
8 min readSep 21, 2022

Gem’s mission is to help companies hire great teams. Our talent engagement platform enables recruiting teams to build relationships that lead to diverse, high-quality talent pipelines, a great candidate experience, and predictable hiring at any scale. One of our products is Prospect Search, which enables recruiters to re-engage with talent they’ve been in communication with previously. Prospect Search supports queries over 50+ attributes such as “software engineers with database experience who we haven’t reached out to in the past year”.

We built the initial version of Prospect Search in 2018 using AWS Aurora Postgres. This worked well for years. However, as the number of prospects in our database grew and we added more search features, we started to notice that some slow queries would take over 30 seconds to complete. We didn’t want latency to become a constraint for teams, so we set an ambitious internal goal to solve our latency within three months with a small team of three engineers. We achieved our goal by migrating from Postgres to Elasticsearch and are now able to search over 200 million prospects with latencies under one second.

Challenges with Postgres

After looking into the slow queries, we learned that full text queries over large blocks of text incur a high I/O cost in Postgres, particularly when they return many hits. The most common queries in Prospect Search tend to be full text queries over resumes, which are large blocks of text, and these queries tend to return many hits. For example, “software engineers in San Francisco” returns over 100,000 prospects.

Our Postgres table stores each resume in a single row with the text in a tsvector column, a representation optimized for fulfilling full text queries, and the column is supported by a GIN index to speed up searches. However, since Postgres does not support index-only scans for GIN indexes (or for GiST indexes, the other kind of index that speeds up tsvector searches), Postgres needs to fetch the full row for each hit, which diminishes the benefit of having the index when the query returns many hits.

For example, using the GIN index by itself to look up 100,000 resumes that contain the terms “software engineer san francisco” incurs a low I/O cost of a few MB since the inverted index directly maps each term to a list of ID’s for resumes that contain the term. However, the subsequent bitmap heap scan that fetches 100,000 matching resumes incurs a very high I/O cost. An average resume row is 4.5KB, so fetching 100,000 average resumes requires loading 450MB at a minimum. In practice, we may fetch much more than that because Postgres loads data in fixed-size pages. The matching resumes tend to be scattered across different pages, so we might fetch an 8KB page but only use a single 2KB resume from that page.

After learning why Prospect Search queries have a high I/O cost, we had a follow-up question: where does the bottleneck in I/O occur? To get to the bottom of this, we needed to understand how Aurora Postgres works under the hood. Aurora decouples compute and storage, such that all compute instances access the same storage layer when they handle queries. The storage layer partitions the dataset into 10GB segments that are automatically replicated and distributed across a fleet of SSD-based nodes. In Aurora, the I/O bottleneck comes not from any individual SSD-based node but rather from the network between the compute and storage layers.

Our Cloudwatch monitoring supported the idea that network bandwidth was indeed the bottleneck during peak times. The instance class that we use, db.r5.16xlarge, comes with a network bandwidth of 20 gigabits (2.5 gigabytes) per second. During peak times, our network throughput (StorageNetworkThroughput + NetworkThroughput) approached the 2.5GB/second limit. DiskQueueDepth, which counts the number of pending I/O requests, should ideally be near 0 but for us occasionally spiked above 200.

Picking a solution

We first tried to improve performance while continuing to use Aurora Postgres. We investigated the following:

  • Using materialized views to transform the search dataset to be more efficient to query proved to not work because Aurora Postgres was unable to refresh a read-optimized materialized view fast enough (within 48 hours).
  • Scaling network bandwidth by adding more read replicas would have been very expensive in terms of AWS cost, as queries would still have incurred a high I/O cost.
  • Enabling index-only scans for full text search by using the RUM index extension was not feasible because neither Aurora nor regular RDS Postgres support the extension.

Given our challenges with Aurora Postgres, we also considered alternative databases that more specifically target the search use case. We looked at the following:

  • Elasticsearch seemed like a viable solution for us.
  • Algolia was risky because we exceeded their recommended index size of 102GB by over 10x.
  • Typesense was very expensive in terms of AWS cost, as its non-sharded, in-memory index design meant that the entire index would have needed to fit in memory in a single instance.

Elasticsearch implements optimizations that make it very efficient with I/O for full text queries over large blocks of text, even when they return many hits. Similarly to the Postgres GIN index, Elasticsearch indexes text fields using an inverted index. However, unlike Postgres, it supports index-only scans on the inverted index, so we don’t need to fetch the full document for each match. Then, when Elasticsearch ranks the top K hits, it further streamlines I/O cost by skipping blocks of documents whose max possible scores cannot make it into the top K hits.

Elasticsearch was also well-suited to support upcoming projects we had at the time on the Prospect Search product roadmap. One project was to improve the relevance of the search results, and Elasticsearch supports a rich set of tools to improve search relevance. Another project was to recommend prospects similar to an individual prospect, and Elasticsearch supports More Like This queries for this use case.

Over a 2 week sprint, we built an end-to-end prototype of Prospect Search on Elasticsearch that indexed all prospects and most of the required prospect attributes. When we manually tried out the prototype, we saw that queries returned in under 1 second. This gave us the confidence to decide to migrate to Elasticsearch.

Migrating to Elasticsearch

It was then time to plan our migration. We had two months remaining and three full-time engineers on the project. There were three main questions to figure out:

  1. How might we load data from upstream sources into Elasticsearch?
  2. What should the schema for the Elasticsearch index look like?
  3. How might we catch data discrepancies introduced during the migration?

For loading data into Elasticsearch, we first implemented a nightly batch ETL job because the product could tolerate up to 48 hours of data staleness, and because it was quick to build. The ETL job extracts data from Snowflake (which has a copy of the raw data from our production database), transforms it on the fly in a single Snowflake query, and loads it to Elasticsearch via an in-house connector built using the Elasticsearch Python client. We were lucky Snowflake was already set up for us, as its massively parallel processing compute cluster runs the complex query that returns the full prospects dataset in just 5 minutes.

Up to 48 hours of data staleness sufficed for the initial release, but customers would still prefer real-time data. We planned to work on a streaming ETL job, which is more complex to build, to provide this for them in a future release.

For the Elasticsearch index schema, the main challenge was managing the relations between prospects and other objects such as job applications. In general, there are three approaches for managing relations in Elasticsearch:

  • A fully denormalized schema is efficient to query. An example of one is: if a prospect submits two applications, the index contains two documents — one for each prospect-application. At query time, we use the collapse parameter on the prospect ID to efficiently de-dupe prospects among the top hits.
  • Nested objects are also efficient to query. However, compared to querying denormalized flat objects, querying nested objects have more limitations that may have surprised us such as this and this.
  • Joins are not as efficient to query, and the benefit of more efficient writes is less relevant since we do not promise real-time data in the product yet.

We decided to use a fully denormalized schema for prospects as it was fast to query and was least likely to surprise us with a limitation during the migration.

For catching data discrepancies introduced during the migration, we set up a tool that routed sample requests to the two backends and flagged diffs in the results. The diff tool enabled us to test that a broad, representative set of requests was working before we rolled out the new backend to users.

Outcome

We succeeded in migrating Prospect Search to Elasticsearch in three months with three engineers, and are happy with how Elasticsearch has been performing for us. p99 latency with Elasticsearch is under one second compared to over 30 seconds with Postgres. Average latency with Elasticsearch is under 100 milliseconds. We have plenty of headroom now to scale moving forward.

Acknowledgements

Drew Regitsky and Ben Hiller for working with me as the core team on this project.

Einas Haddad, Rachel Bloch Mellon, Dmitri Brereton, Iulia Tamas, Krishelle Hardson-Hurley, Raymond Luong, Emil Ibrishimov, Nick Bushak, Desiree Caballero, and Chelsie Chan for reviewing this blog post.

Are you interested in being a part of #LifeAtGem? Head to our careers site to learn more about our culture and what it’s like to work at Gem.

--

--