Creating partitions automatically in PostgreSQL
There are several use cases to split up tables to smaller chunks in a relational database. Our choice of SQL server is PostgreSQL the most advanced open source and free database out there for regular SQL workloads. It has decent support for partitioning data in tables but it is not automatically done. While I was working with a client it came up as a potential optimization to reduce the time it takes to run a query against a smaller portion of the data.
What use cases benefit from partitioning?
There is great coverage on the Postgres website about what benefits partitioning has.
Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE.
Seldom-used data can be migrated to cheaper and slower storage media.
Implementing daily partitions based on dates
First we are going to create a table with only two fields. In production there obvisously more fields but for the sake of simplicity I have trimmed down the rest.
There is only one thing to note here, OIDS=FALSE, that basically tells to Postgres not to assign any OIDS (object identifiers) for the rows in the newly created table. This is the default behaviour of Postgres after the 8.0 release. More about it here: link.
After creating the table we need to create a function that will be used as a trigger to create a partition if it does not exist when inserting to the table. Postgres functions are fun, you should check out what other useful things can be done with them.
One thing to note that this relies on the “date” field being present in the table, and that controls the name of the partition. The “date” field is date type (surprise) and we need to convert it to text so it can be used as a field name in Postgres. Luckily the to_char function does exactly that, we can give a mask how we would like to receive the string. I was choosing YYYY_MM_DD as the mask that gives us nice tables names. There is only one more thing left before we can try to insert into our new system. We need to create a trigger that runs before the actual insert happens. Creating the trigger is simple. The only important thing to note here is that it has to be before insert.
Now we have everything in place for testing partitioning. Lets execute few INSERT statements to see it works as expected.
One minor problem you might notice is that the function does not return how many rows were inserted into the table. Other than that it seems everything is working.
We have few partitions in our setup but there is no good way to check how many exactly there. For checking on our partitions we can craft a simple query and roll it into a view for easier execution.
Lets select all of the partitions we got for the table so far:
Perfect, now we have a good start to use our new setup with automatic partition creation. Few open questions left on the table:
- h̶o̶w̶ ̶t̶o̶ ̶r̶e̶t̶u̶r̶n̶ ̶t̶h̶e̶ ̶c̶o̶r̶r̶e̶c̶t̶ ̶n̶u̶m̶b̶e̶r̶ ̶i̶n̶s̶e̶r̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶e̶ ̶t̶a̶b̶l̶e̶to
- h̶o̶w̶ ̶t̶o̶ ̶r̶e̶t̶u̶r̶n̶ ̶t̶h̶e̶ ̶n̶e̶w̶l̶y̶ ̶c̶r̶e̶a̶t̶e̶d̶ ̶i̶d̶ ̶f̶r̶o̶m̶ ̶t̶h̶e̶ ̶f̶u̶n̶c̶t̶i̶o̶n̶
Changing the INSERT statement to include patent_id when returns:
We can add the same to the actual insert we are issuing.
I am going to update this post when I figure out these things. Thanks for reading!
About our company
StreamBright Data works on empowering companies to use their data to its full potential and accelerate adoption of big data technologies. Its team has many years of experience with data architectures and distributed systems, including SQL and other solutions (key-value, document stores). Visit https://streambrightdata.com to learn more how they can help you with your data questions.