Data validation made easy with — Google Cloud’s Data Validation Tool — DVT

Thakurswati
Google Cloud - Community
5 min readMay 15, 2023

Data migration is like moving to a new house — it requires careful planning, organization, and attention to detail. Just as you would double-check your belongings before moving, data validation plays a vital role in the migration process, providing assurance that your valuable data makes the move securely and error-free

In this article, we’ll explore the different methods to execute data validation using Google Cloud’s data validation tool — DVT(locally & through GCS ) and how it can help make your migration journey smoother.

Well Data validation is not just a one-time activity in migration process; it’s an ongoing journey towards data excellence.But it can be a daunting task that takes up valuable time and resources.

That’s where Google Cloud’s data validation tool (DVT) comes in. Whether you’re migrating databases, applications, or other workloads to the cloud, DVT can help streamline your data validation process.

What is DVT?

DVT, an open sourced Python CLI tool that provides an automated and repeatable solution for validation across different environments. The tool uses the Ibis framework to connect to a large number of data sources including BigQuery, Cloud Spanner, Cloud SQL, Teradata, and more.

DVT supports following types validations:

  • Column validation (count, sum, avg, min, max, group by)
  • Row validation (BQ, Hive, Teradata, Oracle, SQL Server, Postgres, Mysql only)
  • Schema validation
  • Custom Query validation
  • Ad hoc SQL exploration

Key Components

  1. Connections — Before running validations, DVT requires setting up a source and target connection. These connections can be stored as JSON files locally or in a GCS directory.
  2. Validations Config — DVT allows you to save or edit validation configurations in a YAML file. This is useful for running common validations or updating the configuration.YAMLs are not mandatory to execute the validation & you can provide all the configs within CLI command as well.

Lets get on the console now & get our hands dirty!

a. Pre-requisites for using DVT :

  1. Install DVT: Follow the steps mentioned here
  2. Install External Source driver package( required if using external source like SQL server, Teradata, DB2 etc.)

Method 1 — Configs in Local

In Method 1, we’ll store connection & validation details locally & then execute the validation

b. Create Source & Target connections

Command Template:

data-validation connections add --secret-manager-type <None|GCP> --secret-manager-project-id <SECRET_PROJECT_ID> --connection-name CONN_NAME source-type

Secret manager flags are optional

— secret-manager-type <None|GCP>
— secret-manager-project-id <SECRET_PROJECT_ID>

Create Source(here SQL Server) connection:

data-validation connections add --connection-name MY_SQL_CONN MSSQL --host=<host_name> --port=<port> --user=<username> --password=<password> --database <database>

Create Target(here Bigquery) connection:

data-validation connections add --connection-name MY_BQ_CONN BigQuery --project-id MY_GCP_PROJECT
Connection JSON files stored in default Local Directory path

c. Create Validation config YAML

Generate the YAML configuration file by specifying the — config-file flag and the YAML file you want to create.

data-validation validate column -sc MY_SQL_CONN -tc MY_BQ_CONN -tbls <src_tbl>=<tgt_tbl> -c <filename>.yaml 

Here DVT will perform column validation with default aggregation type as ‘COUNT *’.

d. Run the Validation

Run the validation using above created YAML config file

data-validation configs run -c <filename>.yaml
DVT Validation execution using YAML file

You can execute validation without YAML files as well by providing all the configurations in CLI command

data-validation validate column -sc my_bq_conn -tc my_bq_conn -tbls bigquery-public-data.new_york_citibike.citibike_trips

Method 2 — Configs in GCS

In Method 2, we’ll not store connections & validation details locally and access connections & validation files directly from GCS bucket

a. Set Environment Variable

The validation config file is saved to the GCS path specified by the PSO_DV_CONFIG_HOME env variable if that has been set; otherwise, it is saved to wherever the tool is run.

export PSO_DV_CONFIG_HOME=gs://my-bucket/my/connections/path/

Note that if this environment variable is set, query validation YAML configs & connection JSONs both need to be present in GCS path(tightly coupled)

b. Create Source & Target connections

Command Template:

data-validation connections add --secret-manager-type <None|GCP> --secret-manager-project-id <SECRET_PROJECT_ID> --connection-name CONN_NAME source-type

Secret manager flags are optional

— secret-manager-type <None|GCP>
— secret-manager-project-id <SECRET_PROJECT_ID>

Create Source(here SQL Server) connection:

data-validation connections add --connection-name MY_SQL_CONN MSSQL --host=<host_name> --port=<port> --user=<username> --password=<password> --database <database>

Create Target(here Bigquery) connection:

data-validation connections add --connection-name MY_BQ_CONN BigQuery --project-id MY_GCP_PROJECT
Connection JSON files saved in specified GCS path inside connections folder

Note: Connection JSON files are created inside subfolder named ‘connections’ in PSO_DV_CONFIG_HOME variable specified GCS path

c. Create Validation config YAML

Generate the YAML configuration file by specifying the — config-file flag and the YAML file you want to create.

data-validation validate column -sc MY_SQL_CONN -tc MY_BQ_CONN -tbls <src_tbl>=<tgt_tbl> -c <filename>.yaml
Validation YAML config file saved in specified GCS path inside validations folder

Note: Config YAML files are created inside subfolder named ‘validations’ in PSO_DV_CONFIG_HOME variable specified GCS path

d.Run the validation

data-validation configs run -c <filename>.yaml
DVT execution using YAML config file in GCS path

And we are done.Hope this blog helps you to get started with a smooth validation process using DVT, see you next time!

--

--