BigTesty: Integration testing library for BigQuery

Mazlum Tosun
Google Cloud - Community
7 min readSep 3, 2024

1. Why ?

We worked for several customers in Google Cloud and Data projects, and we saw the same issue: Generally, Data Engineers and Developers test their SQL queries only in the console with BigQuery and not with automated tests, launched by CI CD pipelines.

In Data projects, a really popular pattern called ELT (Extract Load Transform) is used, in which a lot of business logic is added in SQL queries.

It’s important to check with automated testing, if the SQL queries work as expected and return the expected data.

Like any other piece of software, SQL queries need to be tested with code and not only manually.

The idea of BigTesty is to execute tests in the real infrastructure, that’s why we employ the term of integration or end to end testing.

In this case, rather than using an emulator, we aim to execute tests directly within the BigQuery engine.

I created a video on this topic in my GCP Youtube channel, feel free to subscribe to the channel to support my work for the Google Cloud community :

English version

French version

2. Installation of the project

BigTesty is written in Python and published as:

BigTesty can be used from the current project via the PyPi package or the Docker image, according to the needs and to developers’ preferences.

2.1. PyPi Package

The project can be installed from PyPi with pip :

pip install bigtesty

2.2 Docker image

Pull the Docker image with the following command:

docker pull groupbees/bigtesty

3. Schema of BigTesty

3.1 The current project

On the left side, we have the current project that uses BigTesty.

This project needs to give the following configuration files:

  • The testing definition files with a Json form
  • The tables configuration file with a Json form

Example of code structure:

The root test folder

This folder contains all the testing definition files and the test scenarios. The format is Json.

The testing definition files with a Json form

Example of a scenarios with a nominal case definition_spec_failure_by_job_name_no_error.json :

{
"description": "Test of monitoring data",
"scenarios": [
{
"description": "Nominal case find failure by job name",
"given": [
{
"input": [
{
"featureName": "myFeature",
"jobName": "jobPSG",
"pipelineStep": "myPipeline",
"inputElement": "myInputElement",
"exceptionType": "myExceptionType",
"stackTrace": "myStackTrace",
"componentType": "myComponentType",
"dwhCreationDate": "2022-05-06 17:38:10",
"dagOperator": "myDagOperator",
"additionalInfo": "info"
}
],
"destination_dataset": "monitoring",
"destination_table": "job_failure"
}
],
"then": [
{
"fields_to_ignore": [
"\\[\\d+\\]\\['dwhCreationDate']"
],
"assertion_type": "row_match",
"actual": "SELECT * FROM `monitoring.job_failure` WHERE jobName = 'jobPSG'",
"expected": [
{
"featureName": "myFeature",
"jobName": "jobPSG",
"pipelineStep": "myPipeline",
"inputElement": "myInputElement",
"exceptionType": "myExceptionType",
"stackTrace": "myStackTrace",
"componentType": "myComponentType",
"dwhCreationDate": "2022-05-06 17:38:10",
"dagOperator": "myDagOperator",
"additionalInfo": "info"
}
]
}
]
}
]
}

In the main bloc, we have a description of the current tests and a list of scenarios.

A scenario object contains 2 blocs:

  • given
  • then

The Given bloc:

This bloc gives the input test data:

  • input: an embedded json list
  • input_file_path : a separate file containing the input test data
  • destination_dataset : the BigQuery output dataset
  • destination_table : the BigQuery destination table where the input test data will be ingested

The input and input_file_path are exclusive and if the two fields are passed, input is taken as a priority

The Then bloc:

  • assertion_type : indicates the assertion type, row_match corresponds to a comparison between the actual and expected list
  • actual : the BigQuery SQL query to test
  • actual_file_path : a separate file containing the SQL query to test
  • expected : the expected data (array)
  • expected_file_path : a separate file containing the expected data

The actual and actual_file_path are exclusive and if the two fields are passed, actual is taken as a priority

Same principle with expected and expected_file_path

We also have the possibility to pass the function_assertion assertion type.
This assertion type gives more flexibility because the users can write their own tests and assertion logics in several functions, example with the following then bloc:

"then": [
{
"assertion_type": "function_assertion",
"actual_file_path": "monitoring/when/find_failures_by_feature_name.sql",
"expected_file_path": "monitoring/then/expected_failures_feature_name.json",
"expected_functions": [
{
"module": "monitoring/then/assertion_functions/expected_functions.py",
"function": "check_actual_matches_expected"
},
{
"module": "monitoring/then/assertion_functions/expected_functions.py",
"function": "check_expected_failure_has_feature_name_psg"
},
{
"module": "monitoring/then/assertion_functions/expected_functions_failure_job_name.py",
"function": "check_expected_failure_has_job_name_my_job"
}
]
}
]

For a expected_functions object, we need to give:

  • The module that corresponds to the Python file containing the function
  • The function that corresponds to the name of the function in the previous module

Example with the file expected_functions and the function check_actual_matches_expected :

def check_actual_matches_expected(actual: List[Dict], expected: List[Dict]) -> None:
print("################### Assertion Function : check_actual_matches_expected ###########################")

result: Dict = deepdiff.DeepDiff(actual,
expected,
ignore_order=True,
exclude_regex_paths=[re.compile("\[\d+\]\['dwhCreationDate']")],
view='tree')

assert result == {}

for element in expected:
assert element['dwhCreationDate'] is not None

In this function, the user can write his own assertion logic based on the actual and expected data, here the logic checks if the actual matches the expected.

This function also checks if the field dwhCreationDate , calculated at runtime, exists.

We planned to propose other assertion types in the future, to give more possibilities to users.

The root tables folder

This folder contains the resources concerning the BigQuery datasets and tables to create.
For example, all the BigQuery schemas are proposed in this folder.

The tables config file

The config file that lists all the BigQuery datasets and tables to create in a Json format.

Example:

[
{
"datasetId": "monitoring",
"datasetRegion": "EU",
"datasetFriendlyName": "Monitoring Dataset",
"datasetDescription": "Monitoring Dataset description",
"tables": [
{
"tableId": "job_failure",
"autodetect": false,
"tableSchemaPath": "schema/monitoring/job_failure.json",
"partitionType": "DAY",
"partitionField": "dwhCreationDate",
"clustering": [
"featureName",
"jobName",
"componentType"
]
}
]
}
]

In this example, we have a dataset called monitoring containing the job_failure table.

The tableSchemaPath field in the Json bloc related to a table, targets on the BigQuery table schema, proposed in the root tables folder

3.2 The BigTesty library logic

BigTesty is based on Pulumi Automation API to manage the infra.

By default, an ephemeral infra is used for the tests:

  • Setup: create an ephemeral infra
  • Teardown: destroy the ephemeral infra
  • Rollback: if there is an error in the pipeline and it remains an infra for the tests, we destroy it.

We can keep the infra if necessary, to let developers analyze the data generated for the tests, via a parameter in the CLI called keep-infra

4. Run with CLI and the Python package

You need to be authenticated with Google Cloud Platform before running tests with BigTesty.

We recommend to be authenticated with Application Default Credentials

gcloud auth application-default login

We need to pass the 3 parameters indicated in the previous section, in the command line to launch the tests:

  • root-test-folder: the root folder containing all the testing files
  • root-tables-folder: the root folder containing all the needed files to create the datasets and tables in BigQuery (Json schema…)
  • tables-config-file: the Json configuration file that lists all the datasets and tables to create in BigQuery

Also, common GCP parameters like:

  • project: the GCP project ID
  • region: the GCP region

BigTesty uses an ephemeral infra internally via the concept of Infra As Code and the backend to host the state must be a cloud Storage bucket.
We need to pass the backend URL via parameter in the CLI:

  • iac-backend-url

The tests can be executed with the following command line:

bigtesty test \
--project $PROJECT_ID \
--region $LOCATION \
--iac-backend-url gs://$IAC_BUCKET_STATE/bigtesty \
--root-test-folder $(pwd)/examples/tests \
--root-tables-folder $(pwd)/examples/tests/tables \
--tables-config-file $(pwd)/examples/tests/tables/tables.json

All the testing files shown in the documentation are accessible from the examples folder proposed at the root of the BigTesty repo.

5. Run with Docker

Instead of passing the arguments by the CLI, we can also pass them with environment variables.

export PROJECT_ID={{project_id}}
export LOCATION={{region}}
export IAC_BACKEND_URL=gs://{{gcs_state_bucket}}/bigtesty
export ROOT_TEST_FOLDER=/opt/bigtesty/tests
export ROOT_TABLES_FOLDER=/opt/bigtesty/tests/tables
export TABLES_CONFIG_FILE_PATH=/opt/bigtesty/tests/tables/tables.json
docker run -it \
-e GOOGLE_PROJECT=$PROJECT_ID \
-e GOOGLE_REGION=$LOCATION \
-e IAC_BACKEND_URL=$IAC_BACKEND_URL \
-e TABLES_CONFIG_FILE="$TABLES_CONFIG_FILE_PATH" \
-e ROOT_TEST_FOLDER=$ROOT_TEST_FOLDER \
-e ROOT_TABLES_FOLDER="$ROOT_TABLES_FOLDER" \
-v $(pwd)/examples/tests:/opt/bigtesty/tests \
-v $(pwd)/examples/tests/tables:/opt/bigtesty/tests/tables \
-v $HOME/.config/gcloud:/opt/bigtesty/.config/gcloud \
groupbees/bigtesty test

Some explanations:

All the parameters are passed as environment variables.
We need also to mount as volumes:

  • the tests root folder : -v $(pwd)/examples/tests:/opt/bigtesty/tests
  • the tables root folder: -v $(pwd)/examples/tests/tables:/opt/bigtesty/tests/tables
  • the gcloud configuration: -v $HOME/.config/gcloud:/opt/bigtesty/.config/gcloud

When the authentication is done with Applications Default Credentials via the following command gcloud auth application-default login,
a short-lived credential is generated in the local gcloud configuration: $HOME/.config/gcloud

To prevent the use of a long-lived SA token key, we can share and mount as volume the local gcloud configuration with the Docker container: -v $HOME/.config/gcloud:/opt/bigtesty/.config/gcloud
With this technic, the container will be authenticated in Google Cloud securely, with your current user in the Shell session.

6. Run with Cloud Build

export PROJECT_ID={{project_id}}
export LOCATION={{region}}
export IAC_BACKEND_URL=gs://{{gcs_state_bucket}}/bigtesty
gcloud builds submit \
--project=$PROJECT_ID \
--region=$LOCATION \
--config examples/ci/cloud_build/run-tests-cloud-build.yaml \
--substitutions _IAC_BACKEND_URL=$IAC_BACKEND_URL \
--verbosity="debug" .

Other CI tools are supported like GitHub Actions and Gitlab CI, check the official documentation to have more details.

Conclusion

This article presented the BigTesty library and why this kind of service can really be useful for the BigQuery and Data communities.

BigQuery is a very popular tool used in many projects.
The SQL queries contain a lot of logic that need to be tested.

BigTesty tries to solve this issue with integration tests on BigQuery and the executions can be easily automated in CI CD pipelines.

We plan to bring evolutions in the future and an integration with popular tools like Dbt and Dataform.

Stay tuned, feel free to contribute and add a GitHub star 🌟 to improve the project and to make it be more popular.

The GitHub repository:

The official documentation:

If you like my articles, videos and want to see my posts, follow me on :

--

--

Mazlum Tosun
Google Cloud - Community

GDE Cloud | Docker Captain | CDO GroupBees | Data | Serverless | IAC | Devops | FP