How to validate files uploaded to GCS via Transfer Appliance

Yunus Durmuş
Google Cloud - Community
2 min readJul 20, 2023

When you upload files to GCS via Transfer Appliance, it verifies the upload of data in the background. Nevertheless, there may have issues while copying data to TA itself. So it is a good practice to validate all the files that have been migrated to GCS buckets at the end.

Transfer Appliance is a great option to upload large amounts of data to cloud

Methodology

We need to check the integrity of files with a hash method. GCS provides hashes of files as metadata in MD5 and CRC32C digests. MD5 has a lower hash collision probability but for large files GCS doesn’t create MD5 as they are stored in multiple pieces in the backend. CRC32C is available for all the files and it is good enough to check the integrity.

Steps

1. Create an inventory of the GCS bucket. GCS has an inventory report feature. It creates a report with filenames and their CRC digests. It may run minimum daily frequency. Running it once is enough. Download it to your local machine for later steps.

2. Create hashes of files locally. rhash is a great utility to generate the hashes. You may install by a package manager of your system. In Mac: brew install rhash

# the below command creates crc32c in base64 recursively in a directory and then outputs to a file
rhash -crc32c -base64 -r -simple -o hashes.tsv .

You may create different hash files for different locations. In the next step you will see how to load them into a data warehouse for comparison.

4. Compare hashes. We will compare both CSVs by loading them into a data warehouse. I use DuckDB for simplicity.

brew install duckdb
# open duckdb cli:
duckdb
-- create table from the onpremise hashes
CREATE TABLE onpremise_hashes(crc32c_hash VARCHAR, filename VARCHAR);
COPY onpremise_hashes from 'hashes.tsv' (HEADER FALSE, DELIMITER ' ');
-- verify that they are uploaded correctly
select * from onpremise_hashes limit 3;
-- create table from the GCS inventory. Adjust the schema based on your export.
CREATE TABLE gcs_inventory(project BIGINT,bucket VARCHAR, name VARCHAR, location VARCHAR,timeCreated TIMESTAMP, crc32c VARCHAR);
COPY gcs_inventory from 'gcs_inventory.csv' (AUTO_DETECT TRUE, HEADER TRUE);
select * from gcs_inventory limit 3;
-- after this part, you should start playing with tables to see the difference
-- for instance
select * from onpremise_hashes as o left join gcs_inventory as g on g.name=o.filename and g.crc32c=o.crc32c_hash;

--

--