How to save terabytes of cloud storage
A simple way for periodically cleaning up Google CloudSQL and Google Cloud Storage data
Whatever cloud provider a company may use, costs are always a factor that influences decision-making, and the way software is written. As a consequence, almost any approach that helps save costs is likely worth investigating.
At Vendasta, we use the Google Cloud Platform. The software we maintain, design, and create, often makes use of Google Cloud Storage and Google CloudSQL; the team I am part of is no exception. The team works on a product called Website, which is a platform for website hosting. The product has two editions: Express (free, limited functionality), and Pro (paid, full functionality). Between the features that the product offers we find period backups of websites; these backups are archived as tar files. Each backup archive has an accompanying record stored in CloudSQL, which contains metadata such as the timestamp of the archive.
Recently, the team I am currently working in wanted to design and implement an approach to delete old backups in order to reduce costs. The requirements were:
- Delete archives and records older than 1 year;
- For the Express edition, keep archives and records that were created in the last 10 days. In addition, keep the backups from the beginning of each month for the last year;
- For the Pro edition, keep archives and records that were created in the last 60 days. In addition, keep the backups created at the beginning of each month for the last year;
- Perform steps 1–3 every day for all the sites the product hosts.
We are going to walk through how the requirements were satisfied for each step, and use Golang code, along with an SQL statement, to demonstrate how to execute the steps.
Google Cloud Storage offers the option of object lifecycle management, which allows users to set custom rules such as deleting bucket objects, or changing the storage option of a bucket, after a specific time period/age. While this may work for many use cases, it did not satisfy our requirements because of custom rules for product editions. In addition, transitioning the whole platform to use a different naming scheme for backups (no context of edition currently) to be able to use lifecycle management would have been an enormous amount of work. As a consequence, the team went ahead with a custom solution that is extensible as additional custom cleanups can be added to it in the future.
The first thing that was implemented was the part that iterates over all the sites, and creates tasks for archive and record cleanups. We chose Golang Machinery, a library for asynchronous job scheduling and parsing. Our team often uses Memorystore, which was the storage layer of choice for our use of Machinery. The scheduling was performed by an internal project that the team uses for scheduling other types of work:
The code above fulfills three main objectives:
- Creates a cron job for a daily task creation;
- Iterates over all the sites;
- Creates the archive and record cleaning tasks.
The second part of the implementation was to design and create the software that consumes the tasks that were scheduled for cleaning. For this part, the team constructed an SQL query that returns all the records that have to be deleted, and satisfy the constraints imposed by the edition of the products; Express, or Pro.
Note that the above code sample is incomplete. To make it a completely workable example, the Machinery task consumer has to be implemented, and BackupArchiveCleanup
, along with BackupRecordCleanup
, be set as the main functions that process the tasks created by ScheduleBackupCleaningTasks
from the previous example. For the code sample above, let’s focus on the SQL query, as it was the part that was the most heavily tested by the team (we do not have backups of backups, and testing is important in general).
The query scans the CloudSQL table that contains the records of the backups (internalDatabaseName
) and deletes records with a timestamp older than a maximum date cutoff (the one year requirement). Any records with a timestamp that are not older than one year are deleted conditionally based on whether their timestamp is less than the minimum date cutoff (60 or 10, days, respectively). In addition, the timestamp should not be in the specified sub-query. The sub-query uses an SQL TIMESTAMPDIFF
to group timestamps based on the difference in months, which results in the first backup of a month being selected (see documentation).
The overall architecture of the cleanup process is represented by the diagram on the left. Generally, there is a service that creates cleanup tasks, one that processes them, both backed by the same underlying storage layer.
Results
The team started the development outlined in this blog with the intent to save storage costs. To measure that, there are two options:
- Use
gsutil du
to measure the total bytes used by specific buckets. However, because of the size of the buckets the team uses, it is recommended to not use this tool for measuring total bytes used; - Use Cloud Monitoring to get the total bytes used by GCS buckets.
After testing for approximately one week in a demo environment, on a subset of websites (again, no backups of backups), the code was released to production. The result is outlined by the following diagram:
The chart outlines the total amount of bytes, from the past six weeks, that our team is currently using in its main Google Cloud Platform project where backups are stored. The chart contains two lines because of the two types of storage the team uses, NEARLINE
and COLDLINE
. Throughout July and the beginning of August, the chart outlines a fast rate of increase in storage use — approximately 40TB. After releasing the cleanup process to the production environment, we notice a much more sustainable rate of growth for total bytes used. This has resulted in approximately $800/month in cost savings. While not much for a company the size of Vendasta, it alleviates the team from periodically performing manual backup cleaning, and automates a process that result in significant cost savings long-term. In addition, it is a process that can be used to replicate the cleanup of other types of data. As a note, you may have noticed that the team did not measure the cost savings from deleting CloudSQL records. The CloudSQL deleted table rows resulted only in MB of space savings, which are not significant by comparison to the storage savings from deleting archives.
Lastly, it is possible that your company’s data retention policy does not allow you to delete data at all. In that case, I suggest using the Archive storage option, which was released this year and provides the most affordable option for long-term data storage.
Vendasta helps small and medium-sized businesses around the world manage their online presence. A website is pivotal for managing online presence and performing eCommerce. My team is working on a website-hosting platform that faces numerous interesting scalability (millions of requests per day), performance (loading speed), and storage (read, and write speed) problems. Consider reaching out to me, or applying to work at Vendasta, if you’d like a challenge writing Go and working with Kubernetes, NGINX, Redis, and CloudSQL!