Configuring Data Pipeline Environments in Dataform

Alex Feldman
Google Cloud - Community
10 min readApr 25, 2024

--

Photo by NOAA on Unsplash

Two great features of Dataform enable us to manage code versions and configure separate environments. These features are Git integration and overriding default compilation settings.

This article considers options for separating development, production, and other required data pipeline environments inside the Google Cloud Platform. Environment, in this case, means separated data sources, compiled code that provides data transformation, and the data pipeline output. We will discuss data pipelines constructed using SQL code and connected GCP services. The code we create for the data pipeline should be the same for all environments. We only need the GitHub pull requests to transfer code between environments.

Dataform development environment configuration

Dataform supports working in different workspaces, possibly overriding the default compilation setting. This feature allows us to separate compilation results for development, QA, and production in BigQuery.

We can configure workspace compilation overrides in the repository settings after creating the Dataform repository. These compilation override settings will be applied only for manual executions during the development process, that is, the development environment. Let’s look at the settings (SETTING tab → Workspace compilation overrides → EDIT button)

Workspace compilation overrides → EDIT.
  • Google Cloud Project. We can use another Google Cloud project as a place where tables will materialize after execution. For example, my-project-development.
  • Schema suffix. We can use the BigQuery schema (dataset) suffix. For example, _development. For this case, the destination dataset will be my_dataset_development. Instead of a static suffix, we can use the workspace name (the branch name in terms of GitHub) using ${workspaseName} value. It can be helpful if several developers work on the same project. For example, for the alex branch, the dataset name is my_dataset_alex.
  • Table prefix. We can add a prefix to table names. For example, development_total_amount or alex_total_amount using ${workspaseName}.

We can use one of the options or their combination.

Dataform production and QA environment configuration

To configure the production or QA environment, first, we need to connect to a remote GitHub repository (it can also be GitLab or Bitbucket) to record changes, manage file versions, and pull requests. To do so, we must authenticate a remote repository through HTTPS or SSH.

If you plan to use GitHub actions (as we consider below), I recommend including the workflow scope option when creating the personal access token on GitHub, as it is not included by default.

When configuring the Git connection settings, we define the default branch name. For example, main for production or qa for QA environments.

The default branch is a GitHub branch containing finished and approved code after a pull request. This branch will serve as a source for the production (or QA) environment compiling code. Overriding the default compilation setting for the default branch configures the Dataform production (QA) environment. The Dataform code life cycle supposes these overrides settings during the release configuration creation. The same options as for a repository can be used here by replacing the GCP project, adding the schema (dataset) suffix, or the table prefix.

Create a release configuration form.

Besides that, we can add compilation variables. We will consider an example below.

The decision to use specific compilation override options depends on various factors, such as the number of developers working on it, whether other GCP services are included in data pipelines, whether mock data sources are used for development and testing, and who is responsible for testing.

Let’s look at the environment element’s relationships if we use the schema (dataset) suffix overriding.

Environments configured on schema (dataset) suffix overriding.

When we develop code in the Dataform development workspace, the automatically compiled code materializes the tables in the BigQuery my_dataset_development dataset. The _development suffix is configured in the development workspace settings. Once we have finished coding, we commit the code to the GitHub development branch and then submit a pull request to the QA branch.

To compile code for the QA environment, we need to create a QA release configuration where we define the _qa suffix. Once the code is compiled and executed, the tables will be materialized in the my_dataset_qa dataset in BigQuery.

After merging the QA pull request into the production branch, we can compile and execute the production environment code to materialize the table in the my_dataset_production dataset.

Mock and production data sources

One of the requirements for environment separation is using the same code for any environment. But what if we use one data source when we develop and test it and another when it works in production? A way to provide it is by using project compilation overrides. The fact that the project overrides affects not only the destination tables but also the source table, which can be used to configure mock data sources for the development environment and actual data sources for the production version.

When we declare the data source in Dataform, we can or cannot use the database parameter in the config block. For example, the source table is always queried from the production project in the following case:

config {
type: "declaration",
database: "my-project-production",
schema: "my_source_dataset",
name: "my_source_table",
}

But here, the source table project is dependent on the override settings:

config {
type: "declaration",
schema: "my_source_dataset",
name: "my_source_table",
}

If we have BigQuery table sources in the production project, we can create the same named tables with mock data in the development project and use them in the development environment. Or if we have two Cloud SQL databases, one for QA and another for production, we can create BigQuery views with the same name in both projects using different external connections. The Dataform source declaration uses identical code, but the output data depends on the Dataform project's override settings.

The same Dataform code for both environments with different results.

For data sources where we can’t create the same name resources in different projects, such as GCS buckets and external connections, we can use compilation variables and $when contractions.

Define environments using compilation variables

Let’s return to the Create release configuration form and add the df_environment compilation variable.

Adding the compilation variable

For example, we have a task to export data to the GCS bucket. Unfortunately, GCP does not allow the creation of buckets with the same name across different projects within the same organization. To distinguish between production and development buckets, let's name the buckets output_prod and output_dev, respectively. The Dataform action code can look like this.

config {
type: "operations",
schema: "my_dataset"
}


DECLARE bucket_uri STRING;

SET bucket_uri = ${when(
dataform.projectConfig.vars.df_environment === "production",
`"gs://output_prod/file.csv"`,
`"gs://output_dev/file.csv"`
)};

EXECUTE IMMEDIATE """
EXPORT DATA
OPTIONS (
uri = '""" || bucket_uri || """',
format = 'CSV',
overwrite = true
)
AS (
SELECT *
FROM ${ref('source_table')}
)
""";

If the df_environment variable equals production, the destination bucket is output_prod; otherwise, output_dev. If you want to define a variable in the current project explicitly, add its value to the dataform.json file.

{
"defaultSchema": "dataform",
"assertionSchema": "df_assertions",
"warehouse": "bigquery",
"defaultDatabase": "my-project-development",
"defaultLocation": "us-east1"
"vars": {
"df_environment": "development"
}
}

Another example. If we need to create a BigQuery remote function in different GCP projects that trigger different Cloud functions depending on the environment (look at my article for further information on BigQuery remote functions), we can use the default project settings stored in the dataform.json file. Set the defaultDatabase value in the $when construction.

config {
type: "operations",
schema: "my_dataset",
hasOutput: true
}

DECLARE connection_id STRING;
DECLARE endpoint_id STRING;

${when(
dataform.projectConfig.defaultDatabase === "my-project-production",
// setting for the production project
`SET connection_id = 'my-project-production.us-east1.cloud_func_connection';
SET endpoint_id = 'https://us-east1-my-project-production.cloudfunctions.net/my_function';`,
// setting for the development project
`SET connection_id = 'my-project-development.us-east1.cloud_func_connection_dev';
SET endpoint_id = 'https://us-east1-my-project-development.cloudfunctions.net/my_function';`
)}


EXECUTE IMMEDIATE """
CREATE FUNCTION IF NOT EXISTS ${self()}() RETURNS STRING
REMOTE WITH CONNECTION `"""|| connection_id ||"""`
OPTIONS (
endpoint = '"""|| endpoint_id ||"""'
)
""";

Deploying Cloud Functions within Dataform

Let’s continue the previous example about deploying the BigQuery remote function and consider how to deploy the invoked Cloud function (my_function) depending on the environment. We can deploy the same name Cloud function in the development or production project simultaneously with SQL code. We need to add the function code to the Dataform repository and configure the GitHub action using the action YAML file to do it. We can create the new cloud-function directory and copy its function code.

Adding Cloud function code to the repository

I’m not advocating developing the function Python code in the Dataform workspace (Hey, Dataform developers, seize the idea), but we can connect PyCharm (or another tool) to the repository for the function code development.

Also, we need to add a YAML file to configure the GitHub action to automate the Cloud function deployment, depending on the environment. To do it, let’s create the my_function_deploy.yaml file in the .github/workflows directory.

Adding the YAML file.

The my_function_deploy.yaml code can look like this:

on:
pull_request:
branches:
- main
- qa
paths:
- '**cloud-functions/my_function/src/**'

env:
CLOUD_FUNCTION_NAME: 'my_function'
GCP_REGION: 'us-east1'
FILE_SOURCE: './cloud-functions/my_function/src/'
ENTRY_POINT: 'main'


jobs:
deploy-cloud-function:
runs-on: ubuntu-latest
steps:
- name: 'Set project based on branch'
run: |
if [[ "${{ github.event.pull_request.base.ref }}" == "main" ]]; then
echo "GCP_PROJECT_ID=my-project-production" >> "$GITHUB_ENV"
elif [[ "${{ github.event.pull_request.base.ref }}" == "qa" ]]; then
echo "GCP_PROJECT_ID=my-project-development" >> "$GITHUB_ENV"
else
echo "Error: Unsupported destination branch. GCP_PROJECT_ID not set."
exit 1
fi
shell: bash

- name: Checkout repository
uses: actions/checkout@v3

- name: 'Authenticate to Google Cloud'
id: 'auth'
uses: 'google-github-actions/auth@v1'
with:
credentials_json: '${{ secrets.SERVICE_ACCOUNT }}'

- name: 'Set up Cloud SDK'
id: 'sdk'
uses: 'google-github-actions/setup-gcloud@v1'

- name: 'Deploy Cloud function'
id: 'deploy'
run: |
gcloud functions deploy $CLOUD_FUNCTION_NAME \
--project=$GCP_PROJECT_ID \
--region=$GCP_REGION \
--no-allow-unauthenticated \
--ingress-settings=internal-and-gclb \
--runtime=python311 \
--source=$FILE_SOURCE/ \
--entry-point=$ENTRY_POINT \
--trigger-http \
--timeout=500 \
--memory=512MB

The on section triggers an action on pull requests for the main and qa branches, but only if changes affect a specific path.

on:
pull_request:
branches:
- main
- qa
paths:
- '**cloud-functions/my_function/src/**'

The environment variables are defined in the env section, except the GCP_PROJECT_ID variable.

env:
CLOUD_FUNCTION_NAME: 'my_function'
GCP_REGION: 'us-east1'
FILE_SOURCE: './cloud-functions/my_function/src/'
ENTRY_POINT: 'main'

The GCP_PROJECT_ID variable's value depends on the destination branch, which can be either main or qa. It is defined in the Set project based on branch step. The github.event.pull_request.base.ref system variable in charge of destination branch value. If the branch is main, GCP_PROJECT_ID is set as the production project, whereas the development project is set for qa.

- name: 'Set project based on branch'
run: |
if [[ "${{ github.event.pull_request.base.ref }}" == "main" ]]; then
echo "GCP_PROJECT_ID=my-project-production" >> "$GITHUB_ENV"
elif [[ "${{ github.event.pull_request.base.ref }}" == "qa" ]]; then
echo "GCP_PROJECT_ID=my-project-development" >> "$GITHUB_ENV"
else
echo "Error: Unsupported destination branch. GCP_PROJECT_ID not set."
exit 1
fi
shell: bash

Then, the Google Cloud authentication and Cloud SDK set. Ensure that the service account keys are added to GitHub secret settings and that service account permissions allow the deployment of Cloud functions.

- name: Checkout repository
uses: actions/checkout@v3

- name: 'Authenticate to Google Cloud'
id: 'auth'
uses: 'google-github-actions/auth@v1'
with:
credentials_json: '${{ secrets.SERVICE_ACCOUNT }}'

- name: 'Set up Cloud SDK'
id: 'sdk'
uses: 'google-github-actions/setup-gcloud@v1'

Google Cloud recommends using the workload identity federation instead of the service account keys. See the best practices for using service accounts in pipelines.

Finally, the Cloud SDK command deploys the function using variables.

gcloud functions deploy $CLOUD_FUNCTION_NAME \
--project=$GCP_PROJECT_ID \
--region=$GCP_REGION \
--no-allow-unauthenticated \
--ingress-settings=internal-and-gclb \
--runtime=python311 \
--source=$FILE_SOURCE/ \
--entry-point=$ENTRY_POINT \
--trigger-http \
--timeout=500 \
--memory=512MB

After pushing the commit to GitHub, the function will be deployed in the my-organization-development project when we pull request code from the dev into qa branch. The function will be deployed to my-organization-production when pull-requesting is made into main.

Environment security

Another question is organizing appropriate access to the production environment for teams, including data engineers, analysts, QA specialists, and other users, to avoid intentional or accidental data pipeline execution or removing scheduling.

For example, if the QA team should test the new data pipeline after development, it requires QA and production compilation results inside corresponding release configurations with the possibility of executing the QA one. But we can’t configure access for the QA team to only specific release configurations. The solution is to use the environment separation by GCP project and configure Dataform in each project. It should not have any additional costs as Dataform is a free service. We can provide QA team access only to the development project where we configure the QA compilation result.

This approach gives additional benefits, such as protection from mixing production and QA data. When we use Dataform in one project, we should grant read-write permissions for the Dataform service account to all projects where we materialize the execution results. If we configure Dataform in several GCP projects, we can grant full access to the local project and read-only access to others. If the QA compilation result is attempted to materialize in the production project due to code issues, an execution error will be raised.

Dataform's documentation provides the best practice for configuring various options for Dataform environments.

We discussed configuring release configurations and compilation overrides using the Dataform workflow tool, although other methods are available (Python client, Cloud Workflows, Airflow, Dataform API).

Thanks for reading. Please follow me to stay updated and not miss any upcoming articles.

Links:

--

--