Working with variables in Dataform

Alex Feldman
7 min readJul 3, 2024

--

Photo by James Wainscoat on Unsplash

Variables make a developer’s life easier. They store values and allow us to reuse them during development. Dataform enables us to use GoogleSQL, JavaScript, and special compilation variables.

GoogleSQL variables

The GoogleSQL procedural language allows us to define and reuse variables in multi-statement queries [1]. To define variables, we use the DECLARE statement. The optional DEFAULT clause specifies the default value. If it’s omitted, it is equal to NULL. The SET statement assigns the variable values. Next, we can use this variable in the query.

DECLARE d DATE DEFAULT CARRENT_DATE();

SET d = CARRENT_DATE() - 1;

SELECT d as date_value

DECLARE and SET operations are performed during the query execution. We can use the calculation results to set the variable value. We can declare and use variables only in one query. However, we can create a UDF function to share values between queries instead.

Dataform JS variables

Dataform allows us to use JavaScript variables or functions in action code [2]. In this article, our focus will be on variables.

JS variables are declared using the const keyword in the js block.

js { 
const VARIABLE_NAME = VARIABLE_VALUE;
module.exports = { VARIABLE_NAME };
}

The JS variables support various data types, but their use as the Dataform variables is limited by the SQL code's processing capability. Dataform variables can accept integer, float, boolean, DateTime, string scalar values, and JS functions. The string value can also be an SQL code snippet. An array is considered valid if all elements have the same type.

config { type: "view" }

js {

const int_value = 4;

const float_value = 5.5;

const bool_value = true;

const string_value = "hey there";

// Using SQL code snippet
const previous_date = `CURRENT_DATE()-1`;

// Using JS function
const this_year = new Date().getFullYear();

// Array is wrapped up with backquotes. Use the extra variable inside.
const array_value = `["one", "two", "${string_value}", "three"]`

// JSON is wrapped up with backquotes
const json_value = `{"first_name": "John", "last_name": "Snow"}`

// Array JSON elements are wrapped up with double quotes
const json_array = `[
"{'first_name': 'John', 'last_name': 'Snow'}",
"{'first_name': 'Alex', 'last_name': 'Feldman'}"
]`

module.exports = {
int_value, float_value, bool_value, previous_date,
this_year, array_value
};
}

SELECT ${int_value} as int_value,
${float_value} as float_value,
${bool_value} as bool_value,
'${string_value}' as string_value, -- Add extra quotes
${previous_date} as previous_date,
${this_year} as this_year,
'${json_value}' as json_value, -- Add extra quotes
${array_value} as array_value,
${json_array} as json_array

When declaring variables, it is essential to use quotes for string and array values. This is also relevant for JavaScript objects because GoogleSQL can only process the JSON-like string directly. We can use different quotes, such as single, double, or triple quotes, backquotes, or their combinations. However, since the string variable values are inserted in SQL code without quotes, it is necessary to add extra quotes when we call them (For example, SELECT ‘${string_value}’).

JavaScript supports the null value to represent an empty value in SQL code. An array can not have a null value.

Compiled action code.

SELECT 4 as int_value, 
5.5 as float_value,
true as bool_value,
'hey there' as string_value,
CURRENT_DATE()-1 as previous_date,
2024 as this_year,

["one", "two", "hey there", "three"] as array_value,

'{"first_name": "John", "last_name": "Snow"}' as json_value,

[
"{'first_name': 'John', 'last_name': 'Snow'}",
"{'first_name': 'Alex', 'last_name': 'Feldman'}"
] as json_array

Query result:

Query result

The main advantage of JavaScript over GoogleSQL variables is that they can be used across different actions in the repository. We can declare JS variables in any .sqlx or .js file and make them accessible in another file using the "require” function.

config { type: "view" }

js {
const { my_var } = require("definitions/path/to/declared/variable.sqlx");
}

SELECT ${my_var} as var_value

Dataform has provided a special includes/ directory as a location for global variables and functions.

  // filename is includes/constants.js
const date = `CURRENT_DATE()`;
module.exports = { date };

If the variable is declared inside the includes directory, we can access it without the js block.

config { type: "view" }

SELECT ${constants.date} as date_value

Dataform compilation variables

Dataform supports compilation variables that can be used as environment variables [3]. We can define specific JS variable values for each development workspace and release configuration. This functionality helps us to configure different environments and alter the behavior of actions depending on the environment in which the code is executed.

We can define the compilation variable in the dataform.json file for the development workspace.

{
"defaultSchema": "dataform",
"assertionSchema": "df_assertions",
"warehouse": "bigquery",
"defaultDatabase": "my-project-development",
"defaultLocation": "us-east1"
"vars": {
"df_environment": "development"
}
}

We defined the df_environment variable with the development value.

We can add the compilation variable during the release configuration creation for the release configuration.

Adding the compilation variable during the release configuration creation

To utilize the compilation variable in the action code, the when JS function can be employed.

config { type: "view" }

SELECT ${when(
dataform.projectConfig.vars.df_environment === "production",
`"This is the production" as value`,
`"This is the development" as value`
)};

When the action is executed in the development workspace, "This is the development" will be displayed. If the production release configuration is used, "This is the production" will be present.

For more information about compilation variables and configuring environments in Dataform, please refer to my article "Configuring Data Pipeline Environments in Dataform [4]."

Using variables in actions

Let’s consider an example of the use of variables. Our organization collaborates with agencies to attract customers. Each agency has unique fee conditions. We need to create customer rosters per agency, calculate fees, and store the results in CSV format. We will make a configuration file to store the fee conditions for each agency. We will develop the operations-type action, using procedural language to create a loop that processes data for each agency and exports it to Google Cloud Storage as a CSV file.

Let’s create the config .js file and save the condition parameters.

// Agency fee parameters
const agency_fee_dict = `[
"{'agency': 'First broker agency', 'fee': '0.2', 'file_prefix': 'first_broker'}",
"{'agency': 'Simple marketing', 'fee': '0.1', 'file_prefix': 'simple_mark'}",
"{'agency': 'Best customers', 'fee': '0.15', 'file_prefix': 'best_cust'}"
]`;

module.exports = {
agency_fee_dict
};

The agency_fee_dict variable is a dictionary that includes three parameters: agency—an agency name, fee—a percent of sales amount, and file_prefix—a prefix for the result CSV file name.

Source table with our customers, agencies they were attracted to, and the sale amount looks like this.

Source table

The action code.

config {
type: "operations",
schema: "customer_agency_rosters"
}

js {
const { agency_fee_dict } = require("definitions/customer_agency_rosters/customer_agency_rosters_config.js");
}

DECLARE _agency STRING;
DECLARE _fee STRING;
DECLARE _file_prefix STRING;
DECLARE _iter_item INT64 DEFAULT 1;
DECLARE bucket STRING DEFAULT ${when(
dataform.projectConfig.vars.df_environment === "production",
`'gs://bucket_prod/';`,
`'gs://bucket_dev/';`
)}

WHILE _iter_item <= (ARRAY_LENGTH(${agency_fee_dict})) DO

SET _agency = JSON_EXTRACT_SCALAR (${agency_fee_dict}[SAFE_ORDINAL(_iter_item)], '$.agency');
SET _fee = JSON_EXTRACT_SCALAR (${agency_fee_dict}[SAFE_ORDINAL(_iter_item)], '$.fee');
SET _file_prefix = JSON_EXTRACT_SCALAR (${agency_fee_dict}[SAFE_ORDINAL(_iter_item)], '$.file_prefix');

EXECUTE IMMEDIATE """
EXPORT DATA
OPTIONS (
uri = '""" || bucket || _file_prefix || '_*.csv' """',
format = 'CSV',
overwrite = true,
header = true
)
AS (
SELECT *, sale_amount * CAST( '""" || _fee || """' as FLOAT64) as fee_amount
FROM ${ref('customer_table')}
WHERE agency = '""" || _agency || """'
)
""";

SET _iter_item = _iter_item + 1;
END WHILE;

Let’s consider the code step by step.

The config block includes the operationals type action and the dataset's name in the schema field.

Next, we declared the GoogleSQL variables using DECLARE. The _agency, _fee, and _file_prefix will store their current values in the loop. The _iter_item is responsible for the loop iteration. The bucket variable defines the bucket name using the when function and the compilation variable df_environment.

The WHILE, SET _iter_item, and END WHILE statements form a loop. The loop continues until the value of _iter_item reaches the array's length.

In the loop body, we extract specific JSON fields using the JSON_EXTRACT_SCALAR function and save them in the current value variable as _agency, _fee, and _file_prefix.

The EXPORT DATA statement [5] fetches data from the source table for each agency, computes the fee amount using the current fee value, and then exports the result to GCS in CSV format. The EXECUTE IMMEDIATE function assists in creating the statement string. Using concatenation operators ||, we inserted the bucket, _file_prefix, and _agency variables into the string.

After the action execution, three CSV files will be created on the GCS bucket, with names according to the file prefix.

Summary

Using Dataform variables- whether GoogleSQL, JavaScript, or compilation variables — boosts our data pipeline’s flexibility and efficiency. GoogleSQL variables help with dynamic queries, JS variables are great for reusability across actions, and compilation variables allow us to customize settings for different environments. These tools help us write cleaner, more adaptable code, making it easier to handle complex data operations.

Thank you for taking the time to read this.

Related Links

--

--