Archiving Individual Table from PostgreSQL using Python

Vaishali Jain
Tata 1mg Technology
5 min readDec 11, 2019
Could there be anymore data!

Every time you start organizing your room, there are two things that always happen:

A) You start putting the stuff you need all the time extremely close to your reach.

B) There’s stuff that you should get rid of but you don’t because the big “WHAT IF you need it in the future” bug attacks you. So, you end up keeping it safely in your house for when that future arrives.

A similar situation arises with the production db when you need to deal with huge volumes of data on a daily basis. The data may not remain useful after a while, but you still keep it because it can be used for multiple purposes in the future.

At 1mg, at least 1 lakh carts are formed on a daily basis. Each cart is specific to a user’s need and it is to remain intact until the user has performed an action on it, such as placing an order or modifying the cart. But this ends up creating around 20 million rows in the cart table. And that’s worrisome. 😥

Following is the description of the cart table:

                      Table "public.cart"      Column      |            Type            |   Modifiers                  
------------------+----------------------------+-------------------
id | bigint | not null (serial)
user_id | text |
create_time | timestamp without time zone|
finish_time | timestamp without time zone|
.... | |
.... | |
etc
Indexes:
"cart_pkey" PRIMARY KEY, btree (id)
"cart_idx1" btree (finish_time)
"cart_idx2" btree (user_id, create_time)
We have used PostgreSQL 9.5.18
  1. Index on user_id, create_time is made in order to fetch active carts according to the users.
  2. To fetch carts specific to time, index on finish_time is necessary.

The objective is to provide fast performing db queries because active carts are critical to us and we can’t afford to compromise on them. We can increase the size of the server (horizontal scaling) but it will only help us to a certain point.

We can’t use indexing because with such a huge volume of data, the server will choose to perform a table scan instead of using an index. Which is why we go for DATA ARCHIVING.

Choosing the right data to archive is important. We use the following options for that:

  1. Archiving the inactive carts for the users who haven’t performed any action from the past 3 months.
  2. Removing/ Archiving the inactive guest carts i.e carts which do not have user_id.
  3. Looking for inactive users from users table and archiving their carts.

Here comes the SOLUTION! 😃

A python script to save our data in a new table. It can be saved at the same or a different location.

Few things to keep in mind before we start writing down the code:

  1. Save all the configuration in the config file in order to make it dynamic.
  2. Hit the query in batches in order to avoid loading the production database, even if we are going to run the script at night.

So, let’s start!

Step 1: Set up the config file. Apart from keeping the db connection information, we store multiple other keys.

"TABLE_NAME": "cart",  
"WHERE_CLAUSE": {
"LIMIT": 1000,
"WHERE_KEYS" : "(create_time <= %s)",
"VALUES": "('2018-01-01',)",
"KEY": "id"
}
  1. TABLE_NAME: The table that needs to be archived
  2. LIMIT: Number of rows that needs to be batched at a time
  3. WHERE_KEYS: The basis on which you want to archive the table as per the use case
  4. VALUES: The values corresponding to the where keys
  5. KEY: Primary key used for delete query

Step 2: Database connection needs to be established. We use python module psycopg2.

psycopg2 is a DB API 2.0 compliant PostgreSQL driver

connection = psycopg2.connect(
user=config["POSTGRES_USER"],
password=config["POSTGRES_PASS"],
host=config["POSTGRES_HOST"],
port=config["POSTGRES_PORT"],
database=config["POSTGRES_DB"]
)

We use psycopg2 instead of any other module because of two reasons:

  1. We need more cursors to insert, delete and count the rows.
  2. We need to use raw SQL queries.

Step 3: Create a new table to run the script for the first time for a certain table.

cursor = connection.cursor()cursor.execute("SELECT EXISTS(SELECT * FROM information_schema.tables WHERE table_name=%s)",(config["TABLE_NAME"]+"_ARCHIVED_TABLE"))

if not cursor.fetchone()[0]:
create_table_query = "CREATE TABLE PUBLIC.{}_ARCHIVED_TABLE
(LIKE PUBLIC.{})".format(config["TABLE_NAME"],
config["TABLE_NAME"])

cursor.execute(create_table_query)

Step 4: Now, we will count the number of rows that we need to archive as per our ‘where’ clause and its values that we stored in the config file.

where_clause = config["WHERE_CLAUSE"]["WHERE_KEYS"]
values = eval(config["WHERE_CLAUSE"]["VALUES"])
count_query = "SELECT COUNT(*) FROM {} WHERE ({})"final_count_query = count_query.format(config["TABLE_NAME"], where_clause)count_cursor.execute(final_count_query, values)

Step 5: In this step, we will crack the Real Deal 💥.

One thing that we need to keep in mind is that we’ll be inserting the rows in a new table and then we will delete them from the original table.

while count_cursor.fetchone()[0] > 0:    
insert_query = "INSERT INTO {}_ARCHIVED_TABLE SELECT * FROM {}
WHERE ({}) LIMIT {} RETURNING {}"

final_insert_query = insert_query.format(
config["TABLE_NAME"],
config["TABLE_NAME"],
where_clause,
config["LIMIT"],
config["WHERE_CLAUSE"]["KEY"]
)
insert_cursor.execute(final_insert_query, values)
connection.commit()

insert_count = insert_cursor.rowcount
print(insert_count, "Record inserted successfully")

if insert_count > 0:
delete_query = "WITH row_batch AS (SELECT * FROM {} WHERE
({}) LIMIT {}) DELETE FROM {} o USING row_batch b WHERE b.{}
= o.{} RETURNING o.{}"

final_delete_query = delete_query.format(
config["TABLE_NAME"],
where_clause,
config["WHERE_CLAUSE"]["LIMIT"],
config["TABLE_NAME"],
config["WHERE_CLAUSE"]["KEY"],
config["WHERE_CLAUSE"]["KEY"],
config["WHERE_CLAUSE"]["KEY"]
)
delete_cursor.execute(final_delete_query, values)
connection.commit()
delete_count = delete_cursor.rowcount
print(delete_count, "Record deleted successfully ")
count_cursor.execute(final_count_query, values)

That’s it!

Now, this script can keep running on a weekly basis. We can also write a cron which can be triggered at night time.

At 1mg, we haven’t faced any major issues so far and we’re also monitoring the results that we are getting through it.

Feels good to solve a crucial problem, doesn’t it? 😎

And just so you know that data archiving is not that common, it doesn’t happen everywhere and it is a big deal. — Rachel Karen Green

By doing this, we have achieved a significant amount of improvement in our API’s response time, we have decreased the size of tables by half and, we have cleaned up the mess by archiving the data in a new table.

I hope I helped you with this article. If you have any questions, please let me know! Till then, Happy Coding. 🙂

If you liked this blog, hit the 👏 and stay tuned for the next one!

--

--