Working with big PostgreSQL databases

Ernestas Kardzys
Zedge Engineering
Published in
4 min readDec 16, 2020
Photo by PostgreSQL on Wikimedia

Zedge systems are constantly used by millions of clients every day, resulting in big load and millions of records being stored on our PostgreSQL databases every day. Unfortunately, we can not have downtime, so if we need to make some database structure changes (like adding a table or a column, adding an index for it etc.) it becomes a challenging process.

This blog post contains some strategies we learned that allows us to modify our PostgreSQL tables while not having a downtime of the database itself.

Strategies

Creating a new table

Creating a new table is a relatively easy process, as this table is not used anywhere on the system. Please note, that PostgreSQL automatically creates an index on a primary key.

Adding a new column

It is a tricky thing, as adding a new column requires lock on the table and if this table is extensively used and big — it might result in a downtime.

If you add a column with a default value, PostgreSQL will need to set to default value all entries of the table:

ALTER TABLE my_table ADD COLUMN new_column VARCHAR(256) NOT NULL DEFAULT ‘Hello’

Query above could take some to run and as result no other queries would be able to use the table. What could you do?

Solution is to add a nullable column as it requires very short lock on the table:

ALTER TABLE my_table ADD COLUMN new_column VARCHAR(256) NULL

The drawback of the solution above is that new column has a value of null, so you later need to set a value for the column — either in the code, or by using additional SQL queries.

Renaming a column

The safest option to perform a column renaming is to follow the following process:

  • Create a new column that will be replacing the old column
  • Deploy a change that would write to both columns simultaneously
  • Copy all the data from old column to the new column
  • Deploy a change that would write to the new column only
  • Drop the old column

Please note, that dropping PostgreSQL column does not physically remove the column, but makes it invisible for SQL operations. So, it won’t immediately reduce size of the table on a disk. For more information please see the official documentation on PostgreSQL Alter table.

Adding an index on a column

When adding an index on the column, PostgreSQL locks the indexed table against write operations (reads can still happen) and adds an index in a single scan of a table:

CREATE INDEX my_index_idx ON my_table(my_column)

This performs well, if table is small and index could be added quickly.

Unfortunately, it might take a while to index a big table on live environment and being unable to write to a table might be not an option. Therefore, PostgreSQL offers concurrent creation of indexes that does not block writes to the table:

CREATE INDEX CONCURRENTLY my_concurrent_index_idx ON my_table(my_column)

Please note, that adding concurrent index will take longer compared to creation of a non-concurrent index, as it would require two table scans instead of one. As creation of concurrent indexes allows for live environment to work, it is a preferred way of creating indexes. For more information please see the official documentation on PostgreSQL Create Indexes

Inserting a large number of records into the table

When inserting data into a table, you need to be aware that having indexes and foreign keys in a table slows the process down significantly. If you could drop indexes/foreign keys on a table, insert data and then recreate indexes/foreign keys — that would increase inserting performance by a lot.

Updating large number of records

When updating large number of records tricky part is to maximize your table’s availability as updates might lock the table.

If disk space is not an issue, probably the easiest approach is to select data into a temporary table and use it for updates. This would maximize your table’s availability because you’d need to keep locks for a short period of time.

You could use something like:

— Create a temporary table with required data
SELECT id, 12345678 as value_to_update_to
INTO my_table_with_update_data
FROM my_table
where my_column > 800000;
— Use a temporary table to update data
WITH rows_to_update AS (
SELECT id, value_to_update_to
FROM my_table_with_update_data
)
UPDATE my_table SET my_column = rows_to_update.value_to_update_to
FROM rows_to_update
WHERE my_table.id = rows_to_update.id;
— Delete the temporary table
DROP TABLE my_table_with_update_data;

Conclusion

At Zedge we receive thousands of requests every minute that eventually find a way into various databases. As the load constantly grows, it becomes a challenge to add new features and maintain the database without causing an outage of the system.

Working with big tables is a challenge, but by using some practices described above, we minimize the risk of having a downtime.

--

--

Ernestas Kardzys
Zedge Engineering

Software Engineer, specialising in Java, Kotlin, Spring and Docker