Accurately calculate sizes of Google Cloud Storage Buckets over time and graph it using Cloud Studio
At eBay, we have leveraged Google Cloud Platform for about 3 years now and during all that time, we have amassed a huge amount of data which includes our AI training data, our inventory data (including images) and tracking data.
Most of this data is held into Google Cloud Storage buckets which is both cheap and versatile. However, it is not easy to see how much data is actually present in these buckets as also graph the bucket storage sizes over time without introducing some additional tooling. To get a deeper understanding on why the traditional tooling including stackdriver and gsutil does not work, I would encourage reading this blog entry. The gist is that stackdriver provides a very crude estimate which is often times not even close to the actual bucket sizes and gsutil does not scale to petabyte scale buckets.
To get around this problem, we developed a tool to monitor the sizes of the Google Cloud Storage Buckets over time. First step is to enable Access Logs & Storage Logs for the buckets you want to monitor.
Access Logs & Storage Logs
A way to get a daily report of your bucket’s statistics is the Access Logs & Storage Logs for Google Cloud Storage. Google Cloud Storage offers access logs and storage logs in the form of CSV files that you can download and view. Access logs provide information for all of the requests made on a specified bucket and are created hourly, while the daily storage logs provide information about the storage consumption of that bucket for the last day.
Enable Access & Storage Logs
This is a sample command to enable acccess & storage logs on
sample-bucket bucket in project
gsutil logging set on -b gs://storage-logs-bucket -o n_project_sample_bucket gs://sample-bucket
The above commands stores the csv files in the
Cloud Storage Analyzer Job
Once the access & storage logs have been enabled on the bucket
- csv files with
_usagesubstring are added to the storage bucket on every access
- csv files with
_storagesubstring are added at the end of each day with the bucket size details.
For very busy buckets with 10s of thousands of access per second, The usage files can get really big (each being 100s of MBs) and we dont really need them for determining the size.
We are only interested in the storage files which contain the bucket size. The storage file looks like this:
To get the actual size (in bytes), divide storage_byte_hours by 24.
Google Cloud Storage Analyzer Job:
The code for this job is here :
Use this tool to monitor the sizes of the Google Cloud Storage Buckets over time. - ptagr/google-cloud-storage-analyzer
On each run,
- It deletes all the
_usageaccess files from the storage bucket.
- It reads the
_storagestorage files and pushes a summary csv file in the same storage bucket.
- This summary file looks like this:
Note that the job goes through all storage files each time and recreates the summary file each time. This can be improved.
The job runs every 30 mins in our case as a kubernetes cron job
Cloud Data Studio
You can create a dashboard in cloud data studio and add a datasource that points to the summary file. This will help to graph the sizes of the monitored buckets over time.
This setup is inspired from this blog.