BigTesty: Integration testing library for BigQuery
1. Why ?
We worked for several customers in Google Cloud and Data projects, and we saw the same issue: Generally, Data Engineers and Developers test their SQL queries only in the console with BigQuery and not with automated tests, launched by CI CD pipelines.
In Data projects, a really popular pattern called ELT (Extract Load Transform) is used, in which a lot of business logic is added in SQL queries.
It’s important to check with automated testing, if the SQL queries work as expected and return the expected data.
Like any other piece of software, SQL queries need to be tested with code and not only manually.
The idea of BigTesty is to execute tests in the real infrastructure, that’s why we employ the term of integration or end to end testing.
In this case, rather than using an emulator, we aim to execute tests directly within the BigQuery engine.
I created a video on this topic in my GCP Youtube channel, feel free to subscribe to the channel to support my work for the Google Cloud community :
English version
French version
2. Installation of the project
BigTesty is written in Python and published as:
- Python package in PyPi
- Docker image in the Docker hub
BigTesty can be used from the current project via the PyPi package or the Docker image, according to the needs and to developers’ preferences.
2.1. PyPi Package
The project can be installed from PyPi
with pip
:
pip install bigtesty
2.2 Docker image
Pull the Docker image with the following command:
docker pull groupbees/bigtesty
3. Schema of BigTesty
3.1 The current project
On the left side, we have the current project that uses BigTesty.
This project needs to give the following configuration files:
- The testing definition files with a Json form
- The tables configuration file with a Json form
Example of code structure:
The root test folder
This folder contains all the testing definition files and the test scenarios. The format is Json.
The testing definition files with a Json form
Example of a scenarios with a nominal case definition_spec_failure_by_job_name_no_error.json
:
{
"description": "Test of monitoring data",
"scenarios": [
{
"description": "Nominal case find failure by job name",
"given": [
{
"input": [
{
"featureName": "myFeature",
"jobName": "jobPSG",
"pipelineStep": "myPipeline",
"inputElement": "myInputElement",
"exceptionType": "myExceptionType",
"stackTrace": "myStackTrace",
"componentType": "myComponentType",
"dwhCreationDate": "2022-05-06 17:38:10",
"dagOperator": "myDagOperator",
"additionalInfo": "info"
}
],
"destination_dataset": "monitoring",
"destination_table": "job_failure"
}
],
"then": [
{
"fields_to_ignore": [
"\\[\\d+\\]\\['dwhCreationDate']"
],
"assertion_type": "row_match",
"actual": "SELECT * FROM `monitoring.job_failure` WHERE jobName = 'jobPSG'",
"expected": [
{
"featureName": "myFeature",
"jobName": "jobPSG",
"pipelineStep": "myPipeline",
"inputElement": "myInputElement",
"exceptionType": "myExceptionType",
"stackTrace": "myStackTrace",
"componentType": "myComponentType",
"dwhCreationDate": "2022-05-06 17:38:10",
"dagOperator": "myDagOperator",
"additionalInfo": "info"
}
]
}
]
}
]
}
In the main bloc, we have a description of the current tests and a list of scenarios.
A scenario object contains 2 blocs:
- given
- then
The Given bloc:
This bloc gives the input test data:
input
: an embedded json listinput_file_path
: a separate file containing the input test datadestination_dataset
: the BigQuery output datasetdestination_table
: the BigQuery destination table where the input test data will be ingested
The input
and input_file_path
are exclusive and if the two fields are passed, input
is taken as a priority
The Then bloc:
assertion_type
: indicates the assertion type,row_match
corresponds to a comparison between the actual and expected listactual
: the BigQuery SQL query to testactual_file_path
: a separate file containing the SQL query to testexpected
: the expected data (array)expected_file_path
: a separate file containing the expected data
The actual
and actual_file_path
are exclusive and if the two fields are passed, actual
is taken as a priority
Same principle with expected
and expected_file_path
We also have the possibility to pass the function_assertion
assertion type.
This assertion type gives more flexibility because the users can write their own tests and assertion logics in several functions, example with the following then
bloc:
"then": [
{
"assertion_type": "function_assertion",
"actual_file_path": "monitoring/when/find_failures_by_feature_name.sql",
"expected_file_path": "monitoring/then/expected_failures_feature_name.json",
"expected_functions": [
{
"module": "monitoring/then/assertion_functions/expected_functions.py",
"function": "check_actual_matches_expected"
},
{
"module": "monitoring/then/assertion_functions/expected_functions.py",
"function": "check_expected_failure_has_feature_name_psg"
},
{
"module": "monitoring/then/assertion_functions/expected_functions_failure_job_name.py",
"function": "check_expected_failure_has_job_name_my_job"
}
]
}
]
For a expected_functions
object, we need to give:
- The
module
that corresponds to the Python file containing the function - The
function
that corresponds to the name of the function in the previous module
Example with the file expected_functions
and the function check_actual_matches_expected
:
def check_actual_matches_expected(actual: List[Dict], expected: List[Dict]) -> None:
print("################### Assertion Function : check_actual_matches_expected ###########################")
result: Dict = deepdiff.DeepDiff(actual,
expected,
ignore_order=True,
exclude_regex_paths=[re.compile("\[\d+\]\['dwhCreationDate']")],
view='tree')
assert result == {}
for element in expected:
assert element['dwhCreationDate'] is not None
In this function, the user can write his own assertion logic based on the actual and expected data, here the logic checks if the actual matches the expected.
This function also checks if the field dwhCreationDate
, calculated at runtime, exists.
We planned to propose other assertion types in the future, to give more possibilities to users.
The root tables folder
This folder contains the resources concerning the BigQuery
datasets
and tables
to create.
For example, all the BigQuery
schemas are proposed in this folder.
The tables config file
The config file that lists all the BigQuery
datasets
and tables
to create in a Json
format.
Example:
[
{
"datasetId": "monitoring",
"datasetRegion": "EU",
"datasetFriendlyName": "Monitoring Dataset",
"datasetDescription": "Monitoring Dataset description",
"tables": [
{
"tableId": "job_failure",
"autodetect": false,
"tableSchemaPath": "schema/monitoring/job_failure.json",
"partitionType": "DAY",
"partitionField": "dwhCreationDate",
"clustering": [
"featureName",
"jobName",
"componentType"
]
}
]
}
]
In this example, we have a dataset called monitoring
containing the job_failure
table.
The tableSchemaPath
field in the Json bloc related to a table, targets on the BigQuery
table schema, proposed in the root tables folder
3.2 The BigTesty library logic
BigTesty is based on Pulumi Automation API to manage the infra.
By default, an ephemeral infra is used for the tests:
- Setup: create an ephemeral infra
- Teardown: destroy the ephemeral infra
- Rollback: if there is an error in the pipeline and it remains an infra for the tests, we destroy it.
We can keep the infra if necessary, to let developers analyze the data generated for the tests, via a parameter in the CLI called keep-infra
4. Run with CLI and the Python package
You need to be authenticated with Google Cloud Platform before running tests with BigTesty.
We recommend to be authenticated with Application Default Credentials
gcloud auth application-default login
We need to pass the 3 parameters indicated in the previous section, in the command line to launch the tests:
- root-test-folder: the root folder containing all the testing files
- root-tables-folder: the root folder containing all the needed files to create the datasets and tables in BigQuery (Json schema…)
- tables-config-file: the Json configuration file that lists all the datasets and tables to create in BigQuery
Also, common GCP parameters like:
- project: the GCP project ID
- region: the GCP region
BigTesty
uses an ephemeral infra internally via the concept of Infra As Code and the backend to host the state must be a cloud Storage
bucket.
We need to pass the backend URL via parameter in the CLI:
- iac-backend-url
The tests can be executed with the following command line:
bigtesty test \
--project $PROJECT_ID \
--region $LOCATION \
--iac-backend-url gs://$IAC_BUCKET_STATE/bigtesty \
--root-test-folder $(pwd)/examples/tests \
--root-tables-folder $(pwd)/examples/tests/tables \
--tables-config-file $(pwd)/examples/tests/tables/tables.json
All the testing files shown in the documentation are accessible from the examples folder proposed at the root of the BigTesty
repo.
5. Run with Docker
Instead of passing the arguments by the CLI, we can also pass them with environment variables.
export PROJECT_ID={{project_id}}
export LOCATION={{region}}
export IAC_BACKEND_URL=gs://{{gcs_state_bucket}}/bigtesty
export ROOT_TEST_FOLDER=/opt/bigtesty/tests
export ROOT_TABLES_FOLDER=/opt/bigtesty/tests/tables
export TABLES_CONFIG_FILE_PATH=/opt/bigtesty/tests/tables/tables.json
docker run -it \
-e GOOGLE_PROJECT=$PROJECT_ID \
-e GOOGLE_REGION=$LOCATION \
-e IAC_BACKEND_URL=$IAC_BACKEND_URL \
-e TABLES_CONFIG_FILE="$TABLES_CONFIG_FILE_PATH" \
-e ROOT_TEST_FOLDER=$ROOT_TEST_FOLDER \
-e ROOT_TABLES_FOLDER="$ROOT_TABLES_FOLDER" \
-v $(pwd)/examples/tests:/opt/bigtesty/tests \
-v $(pwd)/examples/tests/tables:/opt/bigtesty/tests/tables \
-v $HOME/.config/gcloud:/opt/bigtesty/.config/gcloud \
groupbees/bigtesty test
Some explanations:
All the parameters are passed as environment variables.
We need also to mount as volumes:
- the tests root folder :
-v $(pwd)/examples/tests:/opt/bigtesty/tests
- the tables root folder:
-v $(pwd)/examples/tests/tables:/opt/bigtesty/tests/tables
- the
gcloud
configuration:-v $HOME/.config/gcloud:/opt/bigtesty/.config/gcloud
When the authentication is done with Applications Default Credentials via the following command gcloud auth application-default login
,
a short-lived credential is generated in the local gcloud
configuration: $HOME/.config/gcloud
To prevent the use of a long-lived SA token key, we can share and mount as volume the local gcloud
configuration with the Docker
container: -v $HOME/.config/gcloud:/opt/bigtesty/.config/gcloud
With this technic, the container will be authenticated in Google Cloud securely, with your current user in the Shell session.
6. Run with Cloud Build
export PROJECT_ID={{project_id}}
export LOCATION={{region}}
export IAC_BACKEND_URL=gs://{{gcs_state_bucket}}/bigtesty
gcloud builds submit \
--project=$PROJECT_ID \
--region=$LOCATION \
--config examples/ci/cloud_build/run-tests-cloud-build.yaml \
--substitutions _IAC_BACKEND_URL=$IAC_BACKEND_URL \
--verbosity="debug" .
Other CI tools are supported like GitHub Actions and Gitlab CI, check the official documentation to have more details.
Conclusion
This article presented the BigTesty library and why this kind of service can really be useful for the BigQuery and Data communities.
BigQuery is a very popular tool used in many projects.
The SQL queries contain a lot of logic that need to be tested.
BigTesty tries to solve this issue with integration tests on BigQuery and the executions can be easily automated in CI CD pipelines.
We plan to bring evolutions in the future and an integration with popular tools like Dbt and Dataform.
Stay tuned, feel free to contribute and add a GitHub star 🌟 to improve the project and to make it be more popular.
The GitHub repository:
The official documentation:
If you like my articles, videos and want to see my posts, follow me on :