Query validation and optimization for BigQuery client applications using dry run

KARTIK MALIK
Google Cloud - Community
4 min readSep 24, 2023

Objective

This article explains how we can use the BigQuery APIs for job run and do Query Validation and optimization in your BigQuery client Applications before executing the query via dry run config. BigQuery supports client libraries in C#, Go, Java, Node.js, PHP, Python, and Ruby. For a more general overview of client libraries within Google Cloud, see Client Libraries Explained.

Background

BigQuery is a fully managed, serverless enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence. Read more: https://cloud.google.com/bigquery/docs/introduction

This article explains how to run a query using BigQuery Job API and perform a dry run . We will also provide code snippets in Java to explain it better.

To explain better a dry run in BigQuery provides the following information:

  • estimate of charges in on-demand mode (getTotalBytesProcessed, getTotalBytesBilled ), etc
  • validation of your query i.e. parsing and checking syntactical correctness and optimality of query by providing a query plan [not covered in this article].
  • approximate size and complexity of your query in capacity mode

Thus it can be used to build a validation layer for your BigQuery client application.

Prerequisites

Whatever service account you are using in your client application should have the following IAM roles: roles/bigquery.job user, roles/bigquery.dataViewer at least. How to grant these roles read here: rolesGrantRevokeAccessOnGCP

Choose your Query Job type: Interactive versus batch queries

In BigQuery, you can run two types of queries as shown below. We suggest understanding them before writing you a BigQuery client.

  • Interactive query jobs, are jobs that BigQuery runs on demand.
  • Batch query jobs, which are jobs that BigQuery waits to run until idle compute resources are available. The below snippet explains how we can set the Job in Batch mode in the Java client.
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(sqlQuery)
// Run at batch priority, which won't count toward concurrent rate limit.
.setPriority(QueryJobConfiguration.Priority.BATCH)
.build();
Job job = bigquery.create(JobInfo.of(queryConfig));

By default, BigQuery runs your queries as interactive query jobs, which are run as soon as possible.

Sample code to for Java Client Application for running Interactive dry run Job

 public void validateQueryWithDryRun(String query,String gcp_project_name){
try {
BigQuery bigquery = BigQueryOptions.getDefaultInstance()
.getService();
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(query).setDryRun(true).
setUseQueryCache(false).
setLabels(ImmutableMap.of(Label1,"userLabel")).build();
Job queryJob = bigquery.create(JobInfo.of(queryConfig));
JobStatistics.QueryStatistics statistics = queryJob.getStatistics(). ;
if (queryJob.getStatus().getState().name()!="DONE") {
log.info("BQ dry run validator not successful =={} : \n" +
queryJob);
throw new Exception();
}
// JobStatistics.QueryStatistics statistics = queryJob.getStatistics();
long bytesProcessed = statistics.getTotalBytesProcessed();
double bytesBilled = statistics.getTotalBytesBilled();
Long slotMs = statistics.getTotalSlotMs();
} catch (BigQueryException e) {
log.error("Bq dry run error: " + e.getMessage());
}
}

Explanation

BigQuery’s dry run validation checks not only the syntax of the query but also performs semantic analysis. It verifies table and column existence, data types, permissions, and other aspects that are specific to BigQuery.

Cost of performing dry run on BQ

Dry runs do not use query slots, and you are not charged for performing a dry run. You can use the estimate returned by a dry run to calculate query costs in the pricing calculator.

Note: A dry run of a federated query that uses an external data source might report a lower bound of 0 bytes of data, even if rows are returned. This is because the amount of data processed from the external table can’t be determined until the actual query is completed. Running the federated query still incurs a cost for processing this data.

Advantage

  1. Can be leveraged to build a validation layer for your client application to reduce error rates by validating and verifying syntactical correctness.
  2. Based on the validation results and cost estimation, the user can make necessary modifications to the query, such as fixing errors, optimizing performance, or adjusting resource utilization, to ensure the query meets their requirements.
  3. Accurate Results: Since dry run validation happens on the actual BigQuery service, it considers the current state of the dataset, tables, and columns, ensuring accuracy in validation results.
  4. Dry Run Execution: Instead of executing the query against the actual data, BigQuery simulates the execution of the optimized query plan. It estimates the resource usage, execution time, and potential costs associated with the query.
  5. Validation Checks: During the dry run execution, BigQuery performs various validation checks, including syntax validation, column, and table existence checks, data type compatibility verification, and access permissions validation.
  6. Validation Results: BigQuery provides the validation results to the user, indicating any errors, warnings, or informational messages discovered during the dry run. These results help the user identify and address potential issues in the query.

Conclusion

In this article, we have explained the use of dry run feature of BigQuery and how we can write a validation layer for over Job/Query API in BigQuery client. We have also talked about the cost and advantages of using dry run feature of BigQuery. We have also shown a sample validation layer for Java-based applications in this article.

References

Hope you liked this article and found it useful. You can reach out on LinkedIn.

--

--