Build Data Vault 2.0 using Dataform

Moukhtar Ahmed
Google Cloud - Community
3 min readJan 24, 2023

What is Dataform ?

Dataform is a GCP tool which allows you to manage data transformation in the ELT process for data integration. After raw data is extracted from source systems and loaded into BigQuery, Dataform helps you to transform it into a well-defined, tested, and documented suite of data tables.

DVForm

DVForm is a Dataform package that generates and executes the ETL you need to build a Data Vault 2.0 Data Warehouse on BigQuery.

Installation

  • Install dataform cli using npm
npm i -g @dataform/cli
  • Create a new dataform project for BigQuery
dataform init bigquery project-dir --default-database <your-google-cloud-project-id>
dataform project structure
  • In your dataform project add the dvform package to your package.json file . You can find the most up-to-date package version on the releases page or you can download the index.js file and import it into includes/index.js then you can refer it from JS or SQLx scripts for more details refer to dataform documentation for reuse code guide.
# An example of package.json file 

{
"name": "demo",
"dependencies": {
"@dataform/core": "2.0.1",
"dvform":"https://github.com/mokhahmed/dvform/archive/0.1.tar.gz"
}
}
  • Install the package
dataform install

How to use DVForm to create Data Vault Model ?

1. Create citibike entities configuration object preferably in a separate models file. includes/models.js [Citibike models example]

  • Define the entity configurations in the following structure
var entity_example = {
"name": "entity_name",
"columns": entity_columns_obj,
"columns_descriptions": entity_columns_descriptions_obj
}
  • Define the entity columns configurations in the following structure
var entity_columns_obj =  [{
"source_column_name": "col1",
"source_data_type": "integer",
"target_column_name": "col2",
"target_data_type": "integer",
"is_key": "Y",
"is_calculated": "N",
"calculated_by": "",
"description": "col2 brief description"
}, ... ]
  • Define the entity columns configurations in the following structure
var entity_columns_descriptions_obj =  [{
"col2": "col2 breif description",
}, ... ]

2. Based on the level of customization required you can

create_data_vault_from_model(
load_id,
source_schema_name,
stage_schema_name,
data_vault_schema_name,
source_tables_prefix,
stage_tables_prefix,
hubs_tables_prefix,
satellites_tables_prefix,
links_tables_prefix,
tables_type,
all_models,
links_tables
)
e.g. to create staging layer tables
create_staging_tables(
load_id,
schema_name,
source_tables_prefix,
target_tables_prefix,
tables_type,
staging_tables
)
e.g. to create data vault layer hub tables
create_hubs_tables(
source_tables_prefix,
target_tables_prefix,
table_type,
schema_name,
hubs_tables
)

e.g. to create users entity stage table

config {
type: "incremental",
schema: "citibike_stage_sql",
name: "stg_sql_users",
description: "Cleaned up data for users data source",
columns: models.users.columns_descriptions,
tags: ["stage-sql"],
bigquery: {
partitionBy: "DATE(load_time)",
clusterBy: ["load_id"]
}
}
${
dvform.get_stage_table(
dataform.projectConfig.vars.load_id,
ref("src_sql_users"),
models.users.columns
)
}

e.g. to create users hub table

config {
type: "incremental",
schema: "citibike_sql_dvm",
name: "hub_sql_users",
columns: models.users.columns_descriptions,
description: "hub users table",
uniqueKey: ["users_hash_id"],
tags: ["data-vault-sql"]
}
${
dvform.get_hub(
models.users.columns,
ref("stg_sql_users"),
"users_hash_id"
)
}

Run Dataform Code

  • Compile the dataform code to confirm that there is no compilation error
dataform compile
  • Run test cases if any.
dataform test
  • Execute and Create the Data Vault Tables
dataform run --vars=load_id=run_1234 --schema-suffix "stage|data-vault"

Next Step is to build a CI/CD pipeline and schedule the pipeline to run periodically.

--

--