Evaluating BigQuery’s Performance with Diverse Data Formats on External Databases: A Comprehensive Analysis

BigQuery Performance Analysis

Priyanshu
Techsalo Infotech
4 min readJul 12, 2024

--

Figure 1) Display Different Format and how their performance affect in Big query

Introduction

Google BigQuery is a powerful tool for large-scale data analytics, but its performance can vary based on data formats . This article provides a comprehensive analysis of how different data formats — such as CSV, JSON, Avro, ORC ,Google Sheets and Parquet — impact BigQuery’s performance. By understanding these factors, you can optimize your data workflows for maximum efficiency with BigQuery.

Performance Analysis of CSV Data

Using first CSV data from this website source

https://www.stats.govt.nz/large-datasets/csv-files-for-download/ as it contains 50,985 of data

CSV is a simple and widely-used format for storing tabular data. While it is easy to read and write, it is not optimized for performance.

Figure 1) 50,985 Lines of csv Data using for performance analysis

Below is the Example Query using For Performance Analysis

WITH IndustryStats AS (
SELECT
Year,
Industry_aggregation_NZSIOC,
Industry_code_NZSIOC,
Industry_name_NZSIOC,
Units,
Variable_code,
Variable_name,
Variable_category,
SAFE_CAST(Value AS FLOAT64) AS Value,
Industry_code_ANZSIC06,
SUM(SAFE_CAST(Value AS FLOAT64)) OVER (PARTITION BY Industry_name_NZSIOC, Variable_name) AS TotalValue,
AVG(SAFE_CAST(Value AS FLOAT64)) OVER (PARTITION BY Industry_name_NZSIOC, Variable_name) AS AvgValue,
ROW_NUMBER() OVER (PARTITION BY Industry_name_NZSIOC ORDER BY SAFE_CAST(Value AS FLOAT64) DESC) AS RankValue
FROM
`genaiexperiments-1.Avartify.Annual_table`
WHERE
SAFE_CAST(Value AS FLOAT64) IS NOT NULL
),

FilteredStats AS (
SELECT
Year,
Industry_aggregation_NZSIOC,
Industry_code_NZSIOC,
Industry_name_NZSIOC,
Units,
Variable_code,
Variable_name,
Variable_category,
Value,
Industry_code_ANZSIC06,
TotalValue,
AvgValue,
RankValue
FROM
IndustryStats
WHERE
RankValue <= 3 -- Top 3 values per industry
)

SELECT
Year,
Industry_aggregation_NZSIOC,
Industry_code_NZSIOC,
Industry_name_NZSIOC,
Units,
Variable_code,
Variable_name,
Variable_category,
Value,
Industry_code_ANZSIC06,
TotalValue,
AvgValue,
RankValue
FROM
FilteredStats
ORDER BY
Industry_name_NZSIOC, Variable_name, RankValue;

Note: Query will be same so that there will be No difference in query performance

Results :

for the CSV Data Type

Performance Analysis of JSON Data Type

For the Analysis Purpose Using the Data just we convert it into different Types convert csv into ndjson as bigquery accept this format.JSON is a lightweight data interchange format that is easy for humans to read and write. However, it is not optimized for performance in BigQuery. JSON files tend to have higher processing times compared to CSV, Avro, and Parquet.

Results:

Performance Analysis of AVRO Data Type

For the Analysis Purpose Using the Data just we convert it into different Types convert csv into Avro. Avro is a row-based storage format that is compact and efficient for serialization. It supports rich data structures and is often used in data exchange scenarios.

Results:

Performance Analysis of ORC Data Type

For the Analysis Purpose Using the Data just we convert it into different Types convert csv into ORC.
ORC is a columnar storage format that is optimized for Hadoop workloads. While it offers good compression and performance benefits in Hadoop

Results:

Performance Analysis of Parquet Data Type

For the Analysis Purpose Using the Data just we convert it into different Types convert csv into Parquet. Parquet is a columnar storage file format optimized for use with Big Data processing frameworks. It is highly efficient for both storage and query performance.

Results:

Performance Analysis of Google Sheet Data Type

For the Analysis Purpose Convert all the Data into Google Sheet and then performed analysis with same query.
Google Sheets is a convenient tool for small-scale data storage and manipulation. However, it is not designed for performance in BigQuery. Among the formats analyzed, Google Sheets had the highest processing time

Results:

The Different files type stored in Google Cloud Storage

Conclusion

Based on my BigQuery analysis, I have concluded that the time taken for processing different file formats varies significantly. Parquet files exhibit the lowest processing time, followed by CSV files. AVRO files come next in terms of efficiency, followed by JSON files. ORC files take longer to process, and Google Sheets have the highest processing time among the formats analyzed. This hierarchy of processing times highlights the importance of choosing the right file format for optimizing performance in BigQuery.

Calculations Result

Note: We at Techsalo Infotech are a team of engineers solving complex Data engineering and Machine learning problems. Please reach out to us at sales@techsalo.com for any query on How to build these systems at scale and in the cloud.

--

--