Building and using an inventory of your Google Cloud Storage objects

Dom Zippilli

Published in

Google Cloud - Community

6 min readJan 8, 2020

Disclaimer

I am a Googler, and I work in Google Cloud specifically. All opinions stated here are my own, not that of Google, LLC.

Introduction

It might surprise you to hear this, but a lot of object storage customers don’t really know what data they’ve got in their buckets. Just like you when searching the remote corners of your desktop filesystem, they find things that surprise them when they start looking. In the case of object storage, this can lead to wasteful storage charges in the best case, and leaked sensitive data in the worst case. Many times, customers know they have this problem, but they don’t know enough about their data to even start cleaning up.

If this sounds like a problem you might have, there’s good news. By building an inventory of your objects that is queryable with a powerful database like BigQuery, you can easily know what you’ve got, and make smart decisions about what to do with it. And you don’t need to write any code to get started.

How To Do This

Have a Google Cloud Platform account and stored objects

Most people who would want to go down this road already have a GCP account, but if this is part of an evaluation of cloud platforms, or you’re thinking about using BigQuery for an inventory of another object storage offering, it’s very easy to get started with GCP. As far as charges you might expect, storage has on-going, monthly costs based on how much you store. It also has operations charges, but those will be negligible in this case. BigQuery also has on-going monthly storage costs, and queries are pay-for-what-you-use.

For most users, the BigQuery storage and query costs will be relatively low compared to object storage costs. If you’re just trying this out, a few objects is all you need — just upload a few photos in the UI.

Clone the repo with the tool I wrote for this and run it

Dramatization of me, saving you time and effort, hopefully.

To hopefully save you a bunch of time and effort, I went ahead and wrote a command line utility that lists objects and streams them into BigQuery. It should run at about 1,000,000 objects every 150 seconds on a decent-sized (~32 vCPU) VM. I recommend doing this from a GCE VM for the best throughput and avoiding any egress charges, but it will work from home or on-premises, too.

The utility is written for Python 3.6. The operating principle is pretty simple. Up to two buckets are listed at a time, and each page of the bucket listing (about 1,000 objects) is dispatched to a separate thread in a pool. This lets the utility ingest lots of object information very quickly and then stream it to BigQuery.

By default, the utility will gather inventories of all buckets in the specified project. If you have a lot of objects in your project you can provide specific buckets in order to “shard” the listing work by bucket. You can also use this feature to list buckets across multiple projects.

You can also provide a prefix to filter the listing, if you need to shard the list of a single very large bucket.

Follow the steps in README.md to install and configure the tool, and for most users it should suffice to just run it without any options:

gcs_inventory load

You should get output that informs you of progress by object count. Since it doesn’t know ahead of time how many objects you have it doesn’t give a progress indicator, though you can get an idea of how many you’ve got by examining the storage/object_count Stackdriver metric.

Finally, open up BigQuery and HAVE FUN

Arvind, it’s time to glean insights from our data with SQL!

Once the utility is done running, you can start querying data (well, technically you can do so before it’s done running, but the data will be incomplete). Here’s some BigQuery SQL that will help you find out interesting stuff about your GCS objects. And to me, at least, finding out interesting stuff is always fun.

What are my top 10 largest objects, in GB? This query will show you.

SELECT
  name,
  ROUND(size / 1000 / 1000 / 1000, 1) as sizeGB,
  storageClass
FROM `project.dataset.table`
ORDER BY size DESC
LIMIT 10

From this, I can see I have a bunch of 2GB files laying around in what happens to be my development project:

name       sizeGB storageClass
<redacted> 2.2    NEARLINE
<redacted> 2.2    NEARLINE
<redacted> 2.1    NEARLINE
<redacted> 2.1    NEARLINE
<redacted> 2.1    NEARLINE
[...]

Perhaps these objects aren’t useful anymore? They’re taking up a lot of space, and now I know, so I can look into it.

Do I have duplicate objects? Use each object’s MD5 or CRC32C checksums to detect duplicate data. In this example, I will use the CRC32C because all objects have it (composed objects don’t have an MD5).

SELECT
  crc32c,
  size,
  COUNT(crc32c) AS copies
FROM `project.dataset.table`
GROUP BY crc32c, size
ORDER BY copies DESC

From this, I can see I have a bunch of objects with the same size and checksum — these are almost definitely duplicates.

crc32c   size     copies
IlUZiA== 51130562 12
j0uDyA== 51130562 12
OVgmZA== 51130562 12
i2S8Ww== 1073741792 12
i2pspw== 51130562 12
RnLi+Q== 51130562 10
[...]

Now, I’ll perform an inner join with this query and my inventory to produce a “hit list” of duplicate objects:

SELECT
  object.name,
  object.crc32c,
  object.size
FROM
  `project.dataset.table` AS object
INNER JOIN (
  SELECT
    crc32c,
    COUNT(crc32c) AS copies
  FROM
    `project.dataset.table`
  GROUP BY crc32c
  ORDER BY copies DESC
) AS checksums
ON
  object.crc32c = checksums.crc32c AND checksums.copies > 1
ORDER BY object.size DESC

Now I have a set of objects that I can investigate further and consider for de-duplication:

name       crc32c   size
<redacted> z/SBzA== 2150214565
<redacted> z/SBzA== 2150214565
<redacted> L4Z8vg== 2148600803
<redacted> L4Z8vg== 2148600803
<redacted> pb4tWw== 2147908484
<redacted> pb4tWw== 2147908484
<redacted> Mi8Jnw== 2145435876
<redacted> Mi8Jnw== 2145435876
<redacted> rs+BzQ== 2142987458
<redacted> rs+BzQ== 2142987458
[...]

Do I have data where it shouldn’t be? I can use object metadata to look for files that probably shouldn’t be in, for example, a publicly-accessible bucket:

SELECT 
  name, 
  bucket 
FROM `project.dataset.table` 
WHERE SUBSTR(name, LENGTH(name) - 2) IN (".db") OR
      SUBSTR(name, LENGTH(name) - 3) IN (".csv", ".txt") OR
      SUBSTR(name, LENGTH(name) - 4) IN (".json") AND
      bucket = "very-public-bucket"

A malicious actor could easily sidestep this, but such a periodic check could catch honest mistakes… and a lot of big problems start with those.

What’s Next

Stay tuned for more articles describing clever things you can do with this information and similar data sources to be smart about managing your data in GCS.

Also, a few improvements for the metadata loader are in sight. The top one is including ACLs and IAM bindings, but I’m still at the “feasibility” step.

Known Issues

One known issue, which you’re more likely to run into if you’re developing rather than just using this utility, is that if you delete and recreate a table and begin streaming into it shortly after that, some of the records might not make it into the new table. I’m not sure how long a deleted table has this effect, but you may want to load new inventories into a new table each time you run this.

Caveats

It’s already stated in the license file, but all the code in the gcs-inventory-loader repo is provided AS-IS, with no guarantees whatsoever. This utility is not supported by Google. Make sure you understand the code thoroughly before you run it so that you know it won’t cause problems for you.
The inventory this creates is not a snapshot. The difference is subtle, but important. A “snapshot” would refer to an inventory of GCS metadata that’s consistent for a particular point in time. The list API doesn’t work that way; each page is a portion of live metadata, so as you progress through the pages the rest of the metadata may change. Objects may be written that you miss, or deleted that show up in your inventory. Producing a consistent snapshot is technically possible, but the approach is completely different and requires significant groundwork right now. That’s a topic for another day.