Build Data Vault 2.0 using Dataform
What is Dataform ?
Dataform is a GCP tool which allows you to manage data transformation in the ELT process for data integration. After raw data is extracted from source systems and loaded into BigQuery, Dataform helps you to transform it into a well-defined, tested, and documented suite of data tables.
DVForm
DVForm is a Dataform package that generates and executes the ETL you need to build a Data Vault 2.0 Data Warehouse on BigQuery.
Installation
- Install dataform cli using npm
npm i -g @dataform/cli
- Create a new dataform project for BigQuery
dataform init bigquery project-dir --default-database <your-google-cloud-project-id>
- In your dataform project add the dvform package to your
package.json
file . You can find the most up-to-date package version on the releases page or you can download the index.js file and import it intoincludes/index.js
then you can refer it from JS or SQLx scripts for more details refer to dataform documentation for reuse code guide.
# An example of package.json file
{
"name": "demo",
"dependencies": {
"@dataform/core": "2.0.1",
"dvform":"https://github.com/mokhahmed/dvform/archive/0.1.tar.gz"
}
}
- Install the package
dataform install
How to use DVForm to create Data Vault Model ?
1. Create citibike entities configuration object preferably in a separate models file. includes/models.js
[Citibike models example]
- Define the entity configurations in the following structure
var entity_example = {
"name": "entity_name",
"columns": entity_columns_obj,
"columns_descriptions": entity_columns_descriptions_obj
}
- Define the entity columns configurations in the following structure
var entity_columns_obj = [{
"source_column_name": "col1",
"source_data_type": "integer",
"target_column_name": "col2",
"target_data_type": "integer",
"is_key": "Y",
"is_calculated": "N",
"calculated_by": "",
"description": "col2 brief description"
}, ... ]
- Define the entity columns configurations in the following structure
var entity_columns_descriptions_obj = [{
"col2": "col2 breif description",
}, ... ]
2. Based on the level of customization required you can
- One function to create everything
create_data_vault_from_model
. e.g. easy_setup example
create_data_vault_from_model(
load_id,
source_schema_name,
stage_schema_name,
data_vault_schema_name,
source_tables_prefix,
stage_tables_prefix,
hubs_tables_prefix,
satellites_tables_prefix,
links_tables_prefix,
tables_type,
all_models,
links_tables
)
- Create layer by layer [Stage and Data Vault]
create_staging_tables(
load_id,
schema_name,
source_tables_prefix,
target_tables_prefix,
tables_type,
staging_tables
)
create_hubs_tables(
source_tables_prefix,
target_tables_prefix,
table_type,
schema_name,
hubs_tables
)
- Create table by table using sqlx [Staging Tables or Hub Tables].
e.g. to create users entity stage table
config {
type: "incremental",
schema: "citibike_stage_sql",
name: "stg_sql_users",
description: "Cleaned up data for users data source",
columns: models.users.columns_descriptions,
tags: ["stage-sql"],
bigquery: {
partitionBy: "DATE(load_time)",
clusterBy: ["load_id"]
}
}
${
dvform.get_stage_table(
dataform.projectConfig.vars.load_id,
ref("src_sql_users"),
models.users.columns
)
}
e.g. to create users hub table
config {
type: "incremental",
schema: "citibike_sql_dvm",
name: "hub_sql_users",
columns: models.users.columns_descriptions,
description: "hub users table",
uniqueKey: ["users_hash_id"],
tags: ["data-vault-sql"]
}
${
dvform.get_hub(
models.users.columns,
ref("stg_sql_users"),
"users_hash_id"
)
}
Run Dataform Code
- Compile the dataform code to confirm that there is no compilation error
dataform compile
- Run test cases if any.
dataform test
- Execute and Create the Data Vault Tables
dataform run --vars=load_id=run_1234 --schema-suffix "stage|data-vault"
Next Step is to build a CI/CD pipeline and schedule the pipeline to run periodically.