How Kampus Merdeka Platform Can Live a Frugal Life in Cloud

GovTech Edu
GovTech Edu

--

Writers: Muhammad Imaduddin, Ibnul Qoyyim, Muhammad Haris & Muhammad Ghifary

TLDR

Reducing the operational cost of cloud utilization by migrating document files from RDBMS to cloud storage with minimum downtime.

Background

Kampus Merdeka is a platform that connects universities, students, and industries in a dynamic and mutually beneficial way. During their last academic year, it enables students to gain out-of-campus experiences such as internships, student exchange, K12 teaching, etc.. The platform has been generating an enormous amount of data as a result of various activities being administered, including documents in binary file formats.

Previously, Kampus Merdeka platform ran on an on-premise server where the infrastructure cost was not a major concern. However, the server could not handle the increasing requirements as the platform grew. Therefore, Kampus Merdeka platform has gradually migrated to the cloud environment, i.e., Google Cloud Platform (GCP).

The cloud migration for Kampus Merdeka platform has been a success, safeguarding its uptime at 99.985% in 2022 (equivalent to ~1 hour and 18 mins downtime during the whole year), which is a massive improvement from the previous year. However, since we conducted the semi lift-and-shift strategy for migrating the legacy database of Kampus Merdeka to the cloud, it introduced a byproduct, i.e., cost inefficiency in utilizing cloud resources due to the large size of databases (DBs) with BLOB or base64 data types hosted in Cloud SQL.

We realized that such lavish cloud utilization for the Kampus Merdeka platform was wasteful. And we believed there was a way to improve the cloud cost efficiency.

Our premise to reduce the cloud utilization cost is simple: migrating all the document files in BLOB or base64 into Google Cloud Storage (GCS), since its cost is much lower than storing the files in Cloud SQL.

So here is our journey to make Kampus Merdeka platform more frugal in utilizing cloud resources.

Journey 1: Migrating Documents from DB to Cloud Storage

Many applications need to store unstructured data or binary files, such as documents, images, videos, or audio. A common question is whether to store them as Binary Large Objects (BLOBs) in a database or as files in a file system. Both options have advantages and disadvantages; the best choice depends on the use case.

In the past, storing documents as BLOBs in a database was standard practice in the small data world. This method has some benefits, such as ensuring data integrity and consistency. It also makes managing access control and security at the database level easier. This is how the legacy system of Kampus Merdeka Platform handled files and documents.

However, when the individual file size grows larger, and so does the scalability and reliability demand, storing files as BLOBs in DB may cause some issues. With a total DB size of >= 20 TB, the Kampus Merdeka platform suffers from performance degradation that slows down data retrieval, increases infrastructure cost, and reduces reliability. This was the case for the platform that stored all document and binary files in the PostgreSQL DBs inside Cloud SQL.

To mitigate the issues, we simply moved all the document and binary files into Google Cloud Storage (GCS), analogous to storing files in a common file system, and then adding an appropriate link in the database. In more technical terms, we retrieved the BLOB data from their table in DB, then decoded the base64 string into its original binary format. Furthermore, the original data was uploaded to the target GCS bucket. Lastly, we created the links to the GCS bucket as file paths stored in the corresponding DB tables.

We wanted to ensure data integrity, so we used the Universally Unique Identifier (UUID) to generate a unique name for each file. This way, we prevent any files in the GCS bucket from being overwritten, even if they have the same filename. During the transition phase, we modified the document insertion and update processes to make sure that future documents will be stored in GCS instead of the DB tables. Additionally, a simple logic for the document retrieval was also added: retrieve from the GCS first or else retrieve from the DB tables.

To protect our data from catastrophic interventions such as unauthorized access, data breach, and unintentional change, we set up strict access control and security in the GCS bucket.

After accomplishing all the changes above, we moved on to the next phase: testing. We tested the platform rigorously for some time, checking for any glitches or errors. The DB scaling down process was not performed immediately until we were confident everything was working smoothly. Finding no disruption after the testing period is the hallmark of completing our journey’s first half.

Journey 2: Scaling down DB instances with Minimum Downtime

After the BLOB data migration to GCS was completed, we were ready to scale down the size of corresponding Kampus Merdeka DBs. The basic idea in doing so is two-fold: replicating the DBs into the new instances except for the BLOB data and then switching to replicas.

The main challenge was to ensure minimum or near-zero downtime for the DB switchover process. After conducting a few experiments, we performed the following steps:

  1. Create new DBs with much lower capacity
  2. Replicate data to the new databases
  3. Adjust Kampus Merdeka back-end services to point to the new DBs and monitor the changes
  4. Delete the old DBs

We would like to highlight a few interesting technical insights.

Data Replication and DB Size Reduction

Data replication in this context means copying and synchronizing all data from the old DB instances to their new counterparts except the BLOB data, since they had been copied as files in GCS as mentioned above. We also initialized the users and schema of the new databases with the smallest storage capacity and auto-increase enabled, so that the final storage capacity would fit the data size. In other words, data replication is the means to reduce DB size.

DBs are replicated to the new Cloud SQL instances excluding the BLOB data, which have been previously migrated to the Cloud Storage

One may wonder why simply removing the BLOB data from the existing DBs is not enough. In fact, that was our first attempt to reduce the DB size. However, we found that the DB size did not shrink as expected. The reason is that the big chunks of the BLOB data still remained in the pg_toast table, since The Oversized-Attribute Storage Technique (TOAST) mechanism is applied in PostgreSQL by default when the size of a particular row exceeds the size of a data block (commonly 8KB). The usual garbage collector did not automatically clean it after we removed the corresponding BLOB column.

A possible solution to clean the pg_toast is using VACUUM FULL. However, this would cost us a long DB downtime that is unacceptable. Therefore, we decided to pursue the size reduction through data replication.

We found two feasible solutions to replicate the data from one database to another. The first option is to use pg_logical, a PostgreSQL extension that allows us to implement a logical replication mechanism for replicating data between two DB servers: subscriber and publisher. Whenever any real-time change is made on the source DB (publisher), the destination DB (subscriber) receives a copy of the replicated DB object.

The second option is to utilize Bucardo, a free and open-source tool that also enables us to perform asynchronous PostgreSQL replication. We compared both options, and here are what we thought:

Considering the pros and cons of each option, we utilize either pg_logical or Bucardo depending on the characteristics of the databases (DB size, downtime tolerance, etc). The reason that pg_logical is still an option is due to its simplicity.

Here we show an example on how to replicate between two DB servers using Bucardo

// Add source DB
bucardo add db <SRC_NAME> dbhost=<SRC_IP> dbuser=<SRC_USER> dbpass=<SRC_PASS> dbname=<SRC_DB>

// Add destination DB
bucardo add db <DST_NAME> dbhost=<DST_IP> dbuser=<DST_USER> dbpass=<DST_PASS> dbname=<DST_DB>

// Add all tables and sequences to be replicated
bucardo add table % db=<SRC_NAME> relgroup=<TBL_GROUP> --verbose
bucardo add sequence % db=<SRC_NAME> relgroup=<TBL_GROUP> -verbose

// Select only specific columns to be synchronized - add as needed
bucardo add customcols <TBL_NAME> "select <col1>, <col2>, ..., "

// Add sync information
bucardo add sync <DB_SYNC> relgroup=<TBL_GROUP> dbs=<SRC_NAME>, <DST_NAME> onetimecopy=2

// Start Bucardo process
bucardo start

// Check sync status
bucardo status <DB_SYNC>

We applied the following settings to reduce technical complexities:

  • We deferred table constraints such as indexes and foreign keys until the initial table copy was completed to make the overall process faster.
  • Specific to GCP, we disabled point-in-time recovery during the initial table copy to avoid huge storage increases from the Write Ahead Log (WAL) archive.

With these settings, we successfully copied ~400 GB of data within ~4 hours using a particular set of DB specifications. In this case, the corresponding DB specifications are as follows: the source DB instance had 32 vCPU, 200 GB memory, and 10 TB storage capacity, while the destination DB instance had 16 vCPU, 100 GB memory, and 10 GB Initial storage capacity. Note that the memory and storage resources of the destination DB instance were set to be much lower than the source DB instance.

Switch-Over Process

At this stage, we have all the ingredients to execute the switch-over process, which means switching the back-end services of Kampus Merdeka platform to use the new, downsized DBs and then decommissioning the old ones. This process will cause a temporary downtime for the write operations.

Before executing the switch-over process, some configurations related to the user privileges and attributes in the source database need to be adjusted to enable minimum downtime and avoid data loss:

  1. Set one user that can alter other users’ privileges. This user must not be shared with other processes.
  2. Set only one user account that can run the replication process. This user could be the same as point (a), but must not be shared with other processes.
  3. All users, except for built-in cloudsql* users and the one(s) mentioned above, must be able to change between read-only, read-write, and no-access modes.

The switch-over process can be done with a single script that follows these steps:

  1. Block the write access to the databases in the back-end services
  2. Deploy the services with the updated configuration. Write downtime starts here.
  3. Verify that Bucardo has copied all the data from the old databases to the new ones. There should be no new data written to the databases at this point.
  4. Unblock the write access to the databases in the backend services.

It is worth noting that the total downtime for the write operations is roughly equal to the time it takes to deploy the updated services.

If issues are found in the last monitoring step, we can go back to using the old database by doing the same whole switch-over steps as explained, but in reverse direction i.e. move from the new to the old database, as a rollback strategy without data loss.

After ensuring the system is working fine without any reliability issues, we can stop the Bucardo replication and remove the old database instances. This marks the end of our journey to improve the efficiency of Kampus Merdeka platform in utilizing cloud resources.

Results

The migration of file documents from RDBMS to cloud storage has had a significant impact for the Kampus Merdeka platform. The migration has successfully reduced the memory size of the DBs from ~23 TB to ~4 TB in Cloud SQL. This reduction in size has led to a decrease in cloud costs by >50%.

Cloud operational cost reduction up to 50% after the file migration to cloud storage

Since the Kampus Merdeka platform runs smoothly after the migration is completed, it proves that the platform can indeed be more economical in utilizing cloud resources. Therefore, we are confident that the Kampus Merdeka platform will operate within our yearly budget allocation for its infrastructure cost.

Acknowledgment

Special thanks to all product engineers in the GovTech Edu’s Kampus Merdeka team, especially Yulistiyan Wardhana (who graduated from GovTech Edu in 2022 and now serves as Senior Software Engineer at GoPay Indonesia), as well as the Cloud Infra team, especially Muhammad Saiful Islam and Calvin Aditya Jonathan for their substantial technical contribution. Without their support, the migration process and cost reduction would not have been possible to achieve.

--

--