Working with variables in Dataform
Variables make a developer’s life easier. They store values and allow us to reuse them during development. Dataform enables us to use GoogleSQL, JavaScript, and special compilation variables.
GoogleSQL variables
The GoogleSQL procedural language allows us to define and reuse variables in multi-statement queries [1]. To define variables, we use the DECLARE
statement. The optional DEFAULT
clause specifies the default value. If it’s omitted, it is equal to NULL
. The SET
statement assigns the variable values. Next, we can use this variable in the query.
DECLARE d DATE DEFAULT CARRENT_DATE();
SET d = CARRENT_DATE() - 1;
SELECT d as date_value
DECLARE
and SET
operations are performed during the query execution. We can use the calculation results to set the variable value. We can declare and use variables only in one query. However, we can create a UDF function to share values between queries instead.
Dataform JS variables
Dataform allows us to use JavaScript variables or functions in action code [2]. In this article, our focus will be on variables.
JS variables are declared using the const
keyword in the js
block.
js {
const VARIABLE_NAME = VARIABLE_VALUE;
module.exports = { VARIABLE_NAME };
}
The JS variables support various data types, but their use as the Dataform variables is limited by the SQL code's processing capability. Dataform variables can accept integer, float, boolean, DateTime, string scalar values, and JS functions. The string value can also be an SQL code snippet. An array is considered valid if all elements have the same type.
config { type: "view" }
js {
const int_value = 4;
const float_value = 5.5;
const bool_value = true;
const string_value = "hey there";
// Using SQL code snippet
const previous_date = `CURRENT_DATE()-1`;
// Using JS function
const this_year = new Date().getFullYear();
// Array is wrapped up with backquotes. Use the extra variable inside.
const array_value = `["one", "two", "${string_value}", "three"]`
// JSON is wrapped up with backquotes
const json_value = `{"first_name": "John", "last_name": "Snow"}`
// Array JSON elements are wrapped up with double quotes
const json_array = `[
"{'first_name': 'John', 'last_name': 'Snow'}",
"{'first_name': 'Alex', 'last_name': 'Feldman'}"
]`
module.exports = {
int_value, float_value, bool_value, previous_date,
this_year, array_value
};
}
SELECT ${int_value} as int_value,
${float_value} as float_value,
${bool_value} as bool_value,
'${string_value}' as string_value, -- Add extra quotes
${previous_date} as previous_date,
${this_year} as this_year,
'${json_value}' as json_value, -- Add extra quotes
${array_value} as array_value,
${json_array} as json_array
When declaring variables, it is essential to use quotes for string and array values. This is also relevant for JavaScript objects because GoogleSQL can only process the JSON-like string directly. We can use different quotes, such as single, double, or triple quotes, backquotes, or their combinations. However, since the string variable values are inserted in SQL code without quotes, it is necessary to add extra quotes when we call them (For example, SELECT ‘${string_value}’
).
JavaScript supports the null
value to represent an empty value in SQL code. An array can not have a null
value.
Compiled action code.
SELECT 4 as int_value,
5.5 as float_value,
true as bool_value,
'hey there' as string_value,
CURRENT_DATE()-1 as previous_date,
2024 as this_year,
["one", "two", "hey there", "three"] as array_value,
'{"first_name": "John", "last_name": "Snow"}' as json_value,
[
"{'first_name': 'John', 'last_name': 'Snow'}",
"{'first_name': 'Alex', 'last_name': 'Feldman'}"
] as json_array
Query result:
The main advantage of JavaScript over GoogleSQL variables is that they can be used across different actions in the repository. We can declare JS variables in any .sqlx
or .js
file and make them accessible in another file using the "require”
function.
config { type: "view" }
js {
const { my_var } = require("definitions/path/to/declared/variable.sqlx");
}
SELECT ${my_var} as var_value
Dataform has provided a special includes/
directory as a location for global variables and functions.
// filename is includes/constants.js
const date = `CURRENT_DATE()`;
module.exports = { date };
If the variable is declared inside the includes
directory, we can access it without the js
block.
config { type: "view" }
SELECT ${constants.date} as date_value
Dataform compilation variables
Dataform supports compilation variables that can be used as environment variables [3]. We can define specific JS variable values for each development workspace and release configuration. This functionality helps us to configure different environments and alter the behavior of actions depending on the environment in which the code is executed.
We can define the compilation variable in the dataform.json
file for the development workspace.
{
"defaultSchema": "dataform",
"assertionSchema": "df_assertions",
"warehouse": "bigquery",
"defaultDatabase": "my-project-development",
"defaultLocation": "us-east1"
"vars": {
"df_environment": "development"
}
}
We defined the df_environment
variable with the development
value.
We can add the compilation variable during the release configuration creation for the release configuration.
To utilize the compilation variable in the action code, the when
JS function can be employed.
config { type: "view" }
SELECT ${when(
dataform.projectConfig.vars.df_environment === "production",
`"This is the production" as value`,
`"This is the development" as value`
)};
When the action is executed in the development workspace, "This is the development"
will be displayed. If the production release configuration is used, "This is the production"
will be present.
For more information about compilation variables and configuring environments in Dataform, please refer to my article "Configuring Data Pipeline Environments in Dataform [4]."
Using variables in actions
Let’s consider an example of the use of variables. Our organization collaborates with agencies to attract customers. Each agency has unique fee conditions. We need to create customer rosters per agency, calculate fees, and store the results in CSV format. We will make a configuration file to store the fee conditions for each agency. We will develop the operations-type action, using procedural language to create a loop that processes data for each agency and exports it to Google Cloud Storage as a CSV file.
Let’s create the config .js
file and save the condition parameters.
// Agency fee parameters
const agency_fee_dict = `[
"{'agency': 'First broker agency', 'fee': '0.2', 'file_prefix': 'first_broker'}",
"{'agency': 'Simple marketing', 'fee': '0.1', 'file_prefix': 'simple_mark'}",
"{'agency': 'Best customers', 'fee': '0.15', 'file_prefix': 'best_cust'}"
]`;
module.exports = {
agency_fee_dict
};
The agency_fee_dict
variable is a dictionary that includes three parameters: agency
—an agency name, fee
—a percent of sales amount, and file_prefix
—a prefix for the result CSV file name.
Source table with our customers, agencies they were attracted to, and the sale amount looks like this.
The action code.
config {
type: "operations",
schema: "customer_agency_rosters"
}
js {
const { agency_fee_dict } = require("definitions/customer_agency_rosters/customer_agency_rosters_config.js");
}
DECLARE _agency STRING;
DECLARE _fee STRING;
DECLARE _file_prefix STRING;
DECLARE _iter_item INT64 DEFAULT 1;
DECLARE bucket STRING DEFAULT ${when(
dataform.projectConfig.vars.df_environment === "production",
`'gs://bucket_prod/';`,
`'gs://bucket_dev/';`
)}
WHILE _iter_item <= (ARRAY_LENGTH(${agency_fee_dict})) DO
SET _agency = JSON_EXTRACT_SCALAR (${agency_fee_dict}[SAFE_ORDINAL(_iter_item)], '$.agency');
SET _fee = JSON_EXTRACT_SCALAR (${agency_fee_dict}[SAFE_ORDINAL(_iter_item)], '$.fee');
SET _file_prefix = JSON_EXTRACT_SCALAR (${agency_fee_dict}[SAFE_ORDINAL(_iter_item)], '$.file_prefix');
EXECUTE IMMEDIATE """
EXPORT DATA
OPTIONS (
uri = '""" || bucket || _file_prefix || '_*.csv' """',
format = 'CSV',
overwrite = true,
header = true
)
AS (
SELECT *, sale_amount * CAST( '""" || _fee || """' as FLOAT64) as fee_amount
FROM ${ref('customer_table')}
WHERE agency = '""" || _agency || """'
)
""";
SET _iter_item = _iter_item + 1;
END WHILE;
Let’s consider the code step by step.
The config block includes the operationals
type action and the dataset's name in the schema
field.
Next, we declared the GoogleSQL variables using DECLARE
. The _agency
, _fee
, and _file_prefix
will store their current values in the loop. The _iter_item
is responsible for the loop iteration. The bucket
variable defines the bucket name using the when
function and the compilation variable df_environment
.
The WHILE
, SET _iter_item
, and END WHILE
statements form a loop. The loop continues until the value of _iter_item
reaches the array's length.
In the loop body, we extract specific JSON fields using the JSON_EXTRACT_SCALAR
function and save them in the current value variable as _agency
, _fee
, and _file_prefix
.
The EXPORT DATA
statement [5] fetches data from the source table for each agency, computes the fee amount using the current fee value, and then exports the result to GCS in CSV format. The EXECUTE IMMEDIATE
function assists in creating the statement string. Using concatenation operators ||,
we inserted the bucket
, _file_prefix
, and _agency
variables into the string.
After the action execution, three CSV files will be created on the GCS bucket, with names according to the file prefix.
Summary
Using Dataform variables- whether GoogleSQL, JavaScript, or compilation variables — boosts our data pipeline’s flexibility and efficiency. GoogleSQL variables help with dynamic queries, JS variables are great for reusability across actions, and compilation variables allow us to customize settings for different environments. These tools help us write cleaner, more adaptable code, making it easier to handle complex data operations.
Thank you for taking the time to read this.
Related Links
- [1] Google, Procedural language.
- [2] Google, Reuse variables and functions with includes in Dataform.
- [3] Google, Ways to configure Dataform compilation.
- [4] Medium, Alex Feldman. Configuring Data Pipeline Environments in Dataform.
- [5] Google, Export and load statements in GoogleSQL.