How to work with open formats on BigQuery

Rodrigo Souza
Google Cloud - Community
11 min readJan 10, 2025

TL;DR: BigQuery supports various open data formats like Avro, Parquet, ORC, JSON, and CSV, allowing you to query data externally stored via external tables. BigQuery also supports open table formats like Iceberg, Delta Lake and Hudi for more advanced data management. This post provides code examples and explanations for working with these formats in BigQuery.

Introduction to Open Formats

In technology, “open formats” refer to publicly accessible specifications that define data structure and storage, allowing anyone to use, modify, and distribute them. These formats are crucial for data interoperability, enabling seamless data exchange between systems without vendor lock-in, while promoting transparency and long-term accessibility. Google BigQuery supports many open data and table formats, enabling efficient analysis of diverse data sources. This post explores BigQuery’s interaction with these formats, including data formats like Parquet and table formats like Iceberg.

Open Data Formats vs. Open Table Formats

When dealing with data, we often encounter two distinct types of open formats: data formats and table formats. While both are essential for managing and analyzing information, they serve different purposes and operate at different levels of abstraction.

Open Data Formats

Open data formats define the structure of individual data items or records. They specify how data is encoded and organized within a file. Key characteristics of open data formats include:

  • Focus on individual records: They define how a single unit of data is represented (e.g., a row in a table, a message in a stream).
  • Encoding specifics: They specify the encoding used for data types (e.g., how integers, strings, dates, or complex objects are represented as bytes).
  • File structure: They define the overall structure of the file containing the data (e.g., binary layout, text-based structure).

Common examples of open data formats include:

  • Avro: A row-based serialization framework known for its schema evolution capabilities.
  • Parquet: A columnar storage format optimized for analytical queries.
  • ORC (Optimized Row Columnar): Another columnar format similar to Parquet, often used in Hadoop ecosystems.
  • JSON (JavaScript Object Notation): A lightweight, text-based format widely used for data exchange on the web.
  • CSV (Comma-Separated Values): A simple, text-based format where values are separated by commas.

Open Table Formats

Open table formats, on the other hand, operate at a higher level of abstraction. They define how a collection of data files (typically stored in an open data format like Parquet or ORC) are organized and managed to represent a table. Key characteristics of open table formats include:

  • Metadata management: They manage metadata about the table, such as the schema, data partitioning, and data file locations.
  • Table operations: They provide support for table-level operations like schema evolution, time travel (querying historical data), and atomic updates.
  • Data organization: They define how data files are organized within the table’s storage location (e.g., directory structures, partitioning schemes).

Open table formats provide a structured layer on top of object storage. Notable examples are:

  • Apache Iceberg: This table format uses a metadata layer to track data files, enabling features like schema evolution, efficient query planning based on data partitioning and filtering, and time travel for auditing and data recovery.
  • Delta Lake: Built on top of Parquet files, Delta Lake adds a transaction log that provides ACID properties, enabling reliable data updates and consistent views of data for Spark-based workloads.
  • Apache Hudi: This format offers similar features to Iceberg and Delta Lake, with a focus on efficient record-level updates and deletes. It’s particularly well-suited for incremental data ingestion and change data capture use cases.

By understanding the distinction between open data formats and open table formats, you can better appreciate the different roles they play in modern data architectures and how they contribute to data interoperability and efficient data management.

Why Use BigQuery with Open Formats?

BigQuery shines when combined with open formats, offering a powerful and flexible solution for a true unified data platform. Here’s why:

  • Simplified Data Management: BigQuery eliminates the need for complex data ingestion pipelines or data movement. You can directly query your data in its native open format stored on a compatible storage system. This reduces data management overhead and simplifies your data infrastructure.
  • Cost Optimization: By leveraging existing open data files, you avoid unnecessary data duplication within BigQuery’s managed storage. This can lead to significant cost savings, especially for large datasets.
  • Advanced Analytics: BigQuery’s serverless architecture and pay-per-use model allow you to scale your analytics workloads efficiently. Combined with the rich functionality of open formats like schema evolution and efficient query planning, you can perform complex data analysis tasks with ease.
  • Interoperability: Open formats provide a common ground for data exchange between various tools and platforms. BigQuery seamlessly integrates with these formats, allowing you to leverage data stored outside of BigQuery for analysis or combine data from multiple sources for a holistic view.

Considerations for Using Open Formats

When talking about open formats in general, we need to understand some basic concepts that will translate later to either a benefit or a limitation derived from the overall data architecture. There are two main points that drives the majority of this discussion:

  • Storage location: all open formats are stored outside of BigQuery, in data stores such as Cloud Storage, Amazon S3, Azure Blob Storage or even Google Drive. Since data is stored “outside”, there will need to be considerations regarding security, performance and costs.
  • Data management: practically all open formats (with the exception of BigQuery tables for Apache Iceberg) are not managed by BigQuery itself, which means that you cannot modify them using DML or other methods. They are treated as read-only objects. Any modification needed on the data itself is done by the system that created/manages them.

Although the aim of this post is not to spark a full discussion of where and how to store the data, it is worth to know that there are trade-offs. While querying external data avoids loading it into BigQuery and increases your ability to integrate systems using open formats, there are things to consider. Query performance can be slower than querying native BigQuery tables due to network latency and data format translation. External tables are read-only; DML is not supported. Data consistency may not be real-time; changes to the underlying data while a query is running can result in unexpected behavior. Finally, external storage and network costs (depending on the region/cloud provider) may apply.

Accessing Open Formats in BigQuery

BigQuery accesses open formats primarily through two mechanisms: External tables and BigLake external tables. I won’t cover BigQuery Omni tables nor Object tables in this post but they are also considered part of the External table set.

The key difference between BigLake and Non-BigLake External tables lies on how data access is managed.

  • External tables grant direct access to the external data source (e.g., GCS), requiring users to have permissions on both the BigQuery table and the underlying storage.
  • BigLake external tables, use delegated access via an external connection tied to a service account.

BigLake external tables simplify security, as users only need permissions on the BigQuery table. BigLake also offers enhanced security features like row-level and column-level security and dynamic data masking for Cloud Storage-based tables. Additionally, BigLake can leverage cached metadata to improve query performance.

When creating either type of external table, you specify the data format and location (URI) of the data files. For self-describing formats like Avro, Parquet, and ORC, BigQuery infers the schema from the data itself. For non-self-describing formats (CSV, JSON), you must provide a schema using autodetect, an inline schema, or a JSON schema file in the CREATE EXTERNAL TABLE statement.

The following examples will focus on BigLake External tables, which is recommended, since they provide a better experience due to its security and caching features. For a complete list of CREATE options for both External and BigLake external tables, please consult the BigQuery documentation.

Avro

CREATE OR REPLACE EXTERNAL TABLE `your-project.your_dataset.your_avro_table`
WITH CONNECTION `your-region.your_connection_name`
OPTIONS (
format = 'AVRO',
uris = ['gs://your-bucket/avro/your_avro_file.avro']
);

Avro, a row-based serialization framework developed within the Apache Hadoop project, uses JSON-defined schemas to describe data structure offering advantages such as schema evolution, a compact binary format, and efficient serialization/deserialization. Commonly used within Hadoop ecosystems, data exchange between systems using different programming languages, and in event streaming and messaging systems. However, it’s less efficient for analytical queries than columnar formats.

ORC (Optimized Row Columnar)

CREATE OR REPLACE EXTERNAL TABLE `your-project.your_dataset.your_orc_table`
WITH CONNECTION `your-region.your_connection_name`
OPTIONS (
format = 'ORC',
uris = ['gs://your-bucket/orc/your_orc_file.orc']
);

ORC (Optimized Row Columnar) is a columnar storage format optimized for Hadoop workloads, providing efficient storage and query performance for analytical queries. It is commonly used in data warehousing and analytical workloads. ORC offers several advantages, including efficient analytical query performance due to its columnar storage, support for schema evolution, and good compression. However, it is less widely used outside of Hadoop environments compared to Parquet.

Parquet

CREATE OR REPLACE EXTERNAL TABLE `your-project.your_dataset.your_parquet_table`
WITH CONNECTION `your-region.your_connection_name`
OPTIONS (
format = 'PARQUET',
uris = ['gs://your-bucket/parquet/your_parquet_file.parquet']
);

Parquet is a columnar storage format widely used in big data ecosystems, designed for efficient analytical queries and providing good compression. It finds common use in data warehousing and analytical workloads, data lakes, and machine learning applications. Its key advantages include high efficiency for analytical queries due to its columnar storage, good compression capabilities, and wide adoption and support across the industry. However, Parquet is less efficient for transactional workloads involving frequent updates or inserts.

CSV (and other delimited text files)

CREATE OR REPLACE EXTERNAL TABLE `your-project.your_dataset.your_csv_table`
(
<column_definition>
)
WITH CONNECTION `your-region.your_connection_name`
OPTIONS (
format = 'CSV', -- Even for other delimiters, use CSV format
uris = ['gs://your-bucket/csv/your_csv_file.csv'],
skip_leading_rows = 1, -- If the file has a header
field_delimiter = ',', -- Specify the delimiter
quote = '"' --Specify the quote character to handle embedded commas
);

CSV (Comma-Separated Values) is a simple, text-based, comma-delimited format commonly used for data import/export, basic data exchange, and storing text-based data like logs. Its advantages are simplicity, human-readability, and widespread support. However, CSV lacks a schema, requiring explicit definition in systems like BigQuery, is inefficient for complex data and large datasets, presents challenges with embedded delimiters, and lacks data type information, necessitating type conversions in queries. Other delimited files (TSV, etc.) are handled similarly by adjusting the delimiter.

JSON

-- For Newline-Delimited JSON (one JSON object per line):
CREATE OR REPLACE EXTERNAL TABLE `your-project.your_dataset.your_json_nd_table`
WITH CONNECTION `your-region.your_connection_name`
OPTIONS (
format = 'NEWLINE_DELIMITED_JSON',
uris = ['gs://your-bucket/json/your_nd_json_file.json']
);

JSON (JavaScript Object Notation) is a lightweight, text-based format widely used for data exchange on the web. It is often used for log files, streaming data, and data exchange between web services. Its advantages include relatively easy parsing, flexible data structures, and widespread use and support. However, JSON is not as efficient for analytical queries as columnar formats and requires parsing within queries using JSON functions.

Apache Iceberg

BigQuery offers two distinct approaches for working with Iceberg tables: “BigLake External Tables for Iceberg” and “BigQuery Tables for Apache Iceberg”.

BigLake External Tables for Iceberg (Managed Externally — Read Only)

CREATE OR REPLACE EXTERNAL TABLE `your-project.your_dataset.your_external_iceberg_table`
WITH CONNECTION `your-region.your_connection_name`
OPTIONS (
format = 'ICEBERG',
uris = ["gs://mybucket/mydata/mytable/metadata/iceberg.metadata.json"]
)

BigLake external tables for Iceberg enable BigQuery to query Iceberg tables managed by external systems like Apache Spark on Dataproc, Hive, or other Iceberg catalog implementations. This approach is useful for hybrid architectures requiring access to Iceberg data from multiple tools (e.g., Spark, Presto, Trino, BigQuery) while maintaining external control over table management.

Creating a BigLake external table for Iceberg in BigQuery involves connecting to the external metadata. While using a JSON metadata file to point to a specific table snapshot is possible (like in the above example), it requires manual updates whenever the table changes and is generally not recommended (see documentation here). For Google Cloud environments, the BigLake Metastore is the preferred approach, as it enables automatic synchronization of tables between Spark and BigQuery workloads.

BigQuery Tables for Apache Iceberg (Managed by BigQuery)

CREATE OR REPLACE TABLE `your-project.your_dataset.your_iceberg_table`(
<column_definition>
)
WITH CONNECTION `your-region.your_connection_name`
OPTIONS (
file_format = 'PARQUET',
table_format = 'ICEBERG',
storage_uri = 'gs://your-bucket/iceberg/your_table_name'
);

BigQuery Tables for Apache Iceberg, on the other hand, offers a powerful and integrated approach to working with Iceberg directly within BigQuery, eliminating the need for a separate Iceberg catalog. These tables provide several key features for modern data warehousing and lakehouse architectures, including:

  • Table mutations using GoogleSQL DML.
  • Unified batch and high-throughput streaming through the Storage Write API with BigLake connectors (like Spark and Dataflow).
  • Flexible schema evolution (adding, dropping, renaming columns, and changing data types).
  • Automatic storage optimization (adaptive file sizing, automatic clustering, garbage collection, and metadata optimization).
  • Robust security features like column-level security and data masking.

At the time of this writing, this feature was in Preview. Creating an Iceberg table in BigQuery is similar to creating standard BigQuery tables but offers additional options due to its storage in open formats on Cloud Storage. These options include: specifying the Cloud resource connection with WITH CONNECTION to configure access to Cloud Storage; specifying the data storage file format with file_format (PARQUET is currently supported in Preview); and specifying the metadata table format with table_format (ICEBERG is currently supported in Preview).

Delta Lake

CREATE OR REPLACE EXTERNAL TABLE `your-project.your_dataset.your_external_deltalake_table`
WITH CONNECTION `your-region.your_connection_name`
OPTIONS (
format ="DELTA_LAKE",
uris=['DELTA_TABLE_GCS_BASE_PATH']);

BigLake external tables for Delta Lake enable BigQuery to query Delta Lake tables managed by external systems, typically Apache Spark on platforms like Databricks or other Spark deployments. As with BigLake External Tables for Iceberg, querying is supported, but data modification (DML operations) must be performed using the external system responsible for managing the Delta Lake table.

Manifest Files (Hudi/Delta workloads)

CREATE OR REPLACE EXTERNAL TABLE `your-project.your_dataset.your_hudi_table`
WITH CONNECTION `your-region.your_connection_name`
OPTIONS (
uris = ['gs://your-bucket/your_hudi_table/manifest/last-snapshot.csv'],
format = 'PARQUET',
file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST');

Manifest files are used to manage and query collections of data files. These files act as a central catalog, listing the individual data files that comprise a logical dataset. A manifest file is essentially a list of file paths or URIs, along with optional metadata about each file. BigQuery supports querying data through manifest files, providing a flexible way to access and analyze data stored in various formats and locations.

Conclusion

The landscape of open formats in BigQuery can be initially confusing due to the array of options with similar names and with the different sets of benefits and limitations. However, this also highlights BigQuery’s strength in supporting a wide variety of open formats, making it a versatile tool in an ever-changing data landscape.

As you navigate this landscape, keep in mind that the best choice is often the one that simplifies your workflow and empowers you to focus on what matters most: extracting insights from your data and bringing business value to your customers.

Now that you’ve gained a deeper understanding of how to work with open formats in BigQuery, it’s time to put your knowledge into action.

  • Explore the possibilities: Experiment with different open formats and discover how they can optimize your data storage, processing, and analysis.
  • Unlock the power of BigQuery: Leverage BigQuery’s capabilities to manage and analyze your data efficiently, regardless of its format and location, all within one unified data platform.

Ready to take the next step? Dive into BigQuery and experience the benefits of open formats firsthand. Your data-driven journey has just begun!

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Rodrigo Souza
Rodrigo Souza

Written by Rodrigo Souza

Customer Engineer - Data Analytics @Google

Responses (1)