PostgresSQL Query Optimisation with Efficient use of UPSERT Queries

Pushkar Singh
Mar 12 · 4 min read

At Tokopedia, we follow the microservice architecture and almost every service has its own database. Data on each service is increasing exponentially over time. Therefore, having the right database, schema and proper use of queries is paramount to match Tokopedia’s scale.

This article shares insights from some of the improvements that our engineering team has done while working on the PostgresSQL database in Tokopedia. Let’s unleash some PostgresSQL Qi!!

Before we begin, I want to describe a bit about our data set, we worked on !!

At Tokopedia, we have data related to our ads which is called statistics data. Businesses or individuals trying to market products or services often create ads to drive their promotion efforts. The ads’ performance data consists of 3 common terms-

Impressions- Whenever an ad gets visible to the user we call it an impression. Impressions of an ad are the number of times that ad was actually shown to the end-user.
Clicks- The number of times a user clicks on an ad.
Conversions- A Conversion is a click that is converted into action.

We have some APIs that publish statistics data in real-time and later, this data is dumped in our PostgreSQL database asynchronously by consumers. Since we have an enormous amount of ads and large data to process in real-time, we run multiple threads/processes to store this data. We trigger other processes later, to disable those ads whose daily credit/balance/budget get exhausted.

Here, I will discuss about optimization done on the consumer part which is responsible for dumping statistics data in our PostgreSQL database.

Problem Statement

The time taken by our system to disable the ads was taking more than expected. As a result, there was a delay in disabling the marketing ads, which was leading to revenue loss.

What went wrong??
Use of select then insert/update query (depending upon whether any row retrieved in the select query) inside an explicit transaction.

Code Snippet for consumer process-

// begin transaction
tx, err := dbMaster.Begin()
//set isolation level
_, err = tx.Exec(`set transaction isolation level repeatable read`)
//select query to find data for an ad
var bool recordExists
query1 := "select click, impression, conversion from stats_table
where ad_id = 123"
//execute query1
err = tx.QueryRow(query1).Scan(&oldClicks, &oldImpressions, &Conversions)
if err == sql.ErrNoRows {
recordExists = false
} else {
recordExists = true
if recordExists {//use update query
query2 = "update stats_table set
click = oldClicks + newClicks,
impression = oldImpressions + newImpressions,
conversion = oldConversions + newConversions,
where ad_id = 123"
} else {
// use insert query
query2 = "insert into stats_table (click, impression, conversion, ad_id) values (newClicks, newImpressions, newConversions, 123)"
}//execute query2
_, err := tx.Exec(query2)
if err != nil {
// if error then rollback transaction
} else {
//if no error commit transaction

The purpose of the above implementation

  • As statistics data dump is handled by many threads/processes to boost data dumping rates, the entire query was wrapped in an explicit transaction block to make writes atomic and non-conflicting with other threads/processes.
  • The select query was used in order to deduce, whether we need to update a row in our stats table or create a new one (update/insert a row) for a particular ad.

Problems with above implementation

  1. As statistics data volume is very high, there are chances that two processes will try to update statistic data for the same ad simultaneously; hence one entry will be aborted, as row lock will be acquired by another process. Thus, leading to data loss.
  2. Number of queries fired on DB for single data insertion are 5(BEGIN transaction, set transaction isolation level, SELECT query to fetch existing data and identify whether we need to fire insert or update query, INSERT/UPDATE query, ROLLBACK/COMMIT transaction). Thus, resulting in higher latency.

UPSERT to the rescue.

Instead of select and then insert/update the use of UPSERT query.

Code Snippet for new implementation-

//new upsert query
query = "insert into stats_table (click, impression, conversion, ad_id)
values (newClicks, newImpressions, newConversions, 123)
on conflict
update set
click = + newClicks,
impression = stats_table.impression + newImpressions,
conversion = stats_table.conversion + newConversions"
//execute upsert query
_, err = dbMaster.Exec(query)

Please note that ad_id in stats_table was unique.

Benefits of the new approach

  1. Queries reduced from 5 to 1 therefore, low latency.
  2. No need for explicit transactions; use implicit transaction managed at PostgreSQL level which will guarantee, data is non-conflicting with other thread/process by maintaining exclusive write locks.
  3. As there will be no explicit transaction; hence no chances for conflicts. Thus, eliminating data loss possibilities.


  1. A decrease in DB Server CPU Usage (3%).
  2. PostgresSQL locks reduced. (75% reduction)
  3. Statistics Data Accuracy Improved by eliminating the use of explicit transactions. Conflicting transactions count reduced to 0.
  4. Improvement by 60% (cost time) to dump the data to the PostgreSQL table.
PostgreSQL Lock Reduction
DB Server CPU Usage Decreased

We have shared some of the common problems faced while optimizing queries in a database and how to solve them. if you have any suggestions or issues for optimizing a query please reach us by commenting below.

Happy SQLing. :)

Tokopedia Engineering

Story from people who build Tokopedia