Unlocking Efficiency and Flexibility: Clone Tables in BigQuery

Rio Dwi Putra Perkasa
GovTech Edu

--

Writer: Rio Dwi Putra Perkasa

In the world of data analytics and data engineering, data replication is a common practice, often performed to support a variety of purposes. Whether it’s creating a staging environment, enabling analytics teams to work on subsets of data, or ensuring disaster recovery, replicating tables is a necessity. However, while replication is essential, it can also lead to increased storage costs.

But why does data replication, a practice so vital to the data analytics and engineering domains, often come at the cost of inflated storage expenses? The answer lies in the inherent nature of replication. Each replicated dataset essentially multiplies the amount of data stored. As data volumes grow, so do the expenses. Additionally, the manual effort in keeping these replicated datasets in sync further contributes to the overall cost and the potential for errors.

The quest to find practical solutions to this common challenge led me to explore the possibilities for optimizing data replication. In this article, I want to share insights about features in BigQuery that potentially address the ever-persistent issue of increased storage costs when replicating data.

BigQuery has a feature called Table Clones. Cloning tables provide a simple and efficient way to create duplicate copies of your existing tables in BigQuery, enabling you to perform various operations without impacting the original dataset and, importantly, no initial cost for cloning the table. This article will explore the benefits, use cases, and best practices of using clone tables in BigQuery.

https://cloud.google.com/bigquery/docs/table-clones-intro

Understanding Table Clones

Table clone allows you to create exact replicas of your tables, including the table structure, metadata, partitioning, and, most importantly, the data itself. It’s like creating a twin of your table with all its contents intact. The best part? Clone Tables are cost-effective and offer a powerful way to work with data in BigQuery.

Advantages of Clone Tables:

  • Schema Preservation: When you clone a table, it keeps the same structure as the original. That means you don’t have to worry about redefining columns, data types, or field definitions. It’s a great way to maintain consistency in your data analysis processes.
  • Metadata Preservation: Clone tables also hold onto the metadata from the original table. So if you had essential descriptions, labels, or timestamps, they’re all carried over to the clone. It helps you keep track of the data’s context and history, which comes in handy for data governance and documentation.
  • Partitioning and Clustering: If your original table was partitioned or clustered, fear not! The clone table inherits the same configurations. So you can continue to query and analyze data based on partitions or clusters without starting from scratch. It saves time and optimizes performance.
  • Cost-Effectiveness: Here’s where clone tables really shine. Clone tables are smarter than traditional data copying methods that create full duplicates. They use a copy-on-write approach, which means the cloned table shares the same underlying data as the original until modifications are made. That’s a smart way to reduce storage costs and be budget-friendly.

Creating and Working with Cloned Tables

To demonstrate the power of Table Clones, let’s start by creating an initial table and then cloning it for further analysis.

Creating the Initial Table

For this example, let’s use the public “chicago_taxi_trips” dataset available in BigQuery. We’ll create an initial table called “taxi_trips_initial” by running the following query:

CREATE OR REPLACE TABLE `<project>.<dataset>.taxi_trips_initial`
AS
SELECT *
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE EXTRACT(YEAR FROM trip_start_timestamp) = 2022
LIMIT 1000

This query creates the “taxi_trips_initial” table by selecting 1000 rows from the “chicago_taxi_trips.taxi_trips” table where the trip start timestamp is in 2022. Change the project and dataset with yours.

Cloning the Initial Table

Once we have the initial table, we can create a clone table without affecting the original data. To clone the “taxi_trips_initial” table, we can use the following query:

CREATE TABLE `<project>.<dataset>.taxi_trips_clone`
CLONE `<project>.<dataset>.taxi_trips_initial`

This query creates a clone table called “taxi_trips_clone” based on the structure and data of the “taxi_trips_initial” table.

Now, with the cloned table at our disposal, we can continue exploring various use cases and running queries on the “taxi_trips_clone” table without affecting the original data. This provides a safe and isolated environment for analysis and experimentation.

For comparison, I have tried to clone a table with a storage description like this

Description of storage infor from initial table
Initial table

The result of clone is like this

cloned table

Although the storage info shows the total bytes the same as the initial table, the backend uses storage from the initial table. So, no new data is physically copied, but instead, the clone shares the underlying data with the original source.

Best Practices for Working with Table Clones

Table Clones in BigQuery provide a powerful tool for data analysis and experimentation. To make the most of this feature, consider the following best practices.

  1. Naming Convention: Use a clear and consistent naming convention for your cloned tables. Include a distinguishing term like “clone” or “copy” in the table name to differentiate it from the original table. This practice helps maintain clarity and avoids confusion when working with multiple tables.
  2. Documentation: Document the purpose and context of each cloned table. Add descriptive comments or annotations to the cloned table to provide information about its intended use, modifications made, or analysis performed. This documentation helps you and others understand the purpose and history of the cloned table.
  3. Regular Refresh or Recreate: Depending on your analysis requirements, periodically refresh or recreate your cloned tables to ensure they reflect the most up-to-date data. This step is crucial, especially if your original table is regularly updated or if you’re working with time-sensitive data.
  4. Storage Costs: Keep in mind that storage costs may apply to table clones. BigQuery only charges for the data in a table clone that is not already charged to another table. Be aware of the additional storage consumed by changes made to the cloned table compared to the original table. Regularly review and manage your cloned tables to optimize storage consumption and control costs.
  5. Security and Access Control: Ensure appropriate access controls are in place for cloned tables. Implement proper permissions and share access only with relevant individuals or teams. This practice safeguards sensitive or confidential data and prevents unauthorized access.

Limitations and Alternatives

While clone tables in BigQuery offer flexibility and convenience, they do have certain limitations to be aware of. Understanding these limitations can help you make informed decisions and explore alternative approaches when necessary. Consider the following:

Actually, you can see the limitation in official documentation https://cloud.google.com/bigquery/docs/table-clones-intro#limitations. Here, I want to add my own version.

  1. Data Availability Constraints: When working with clone tables, be aware that they might have data availability constraints. For example, if the original table is modified while queries run on the cloned table, the cloned table might not reflect the most recent changes. Ensure that the cloned table meets your data availability requirements, and consider refreshing or recreating the clone as needed.
  2. Schema Synchronization: Clone tables do not automatically synchronize their schema with the original table. If the schema of the original table changes, you will need to manually update the schema of the cloned table to reflect those changes. Failure to keep the schema in sync can lead to unexpected behavior or errors in your analysis.

Table Comparison: Clone Tables, Snapshots, and Copy Tables

As a bonus, when working with data in Google BigQuery, various methods are available for duplicating or preserving tables. Each method offers its own advantages and considerations, making it crucial to understand their differences to choose the most suitable approach for your data needs. This section will compare three commonly used methods: clone tables, snapshots, and copy tables.

I have prepared a comprehensive table comparison to help you better understand the differences between these methods. The comparison covers various aspects, including creation, data sharing, schema synchronization, cross-project capabilities, data availability, storage costs, and common use cases. By evaluating these metrics, you can determine which method aligns best with your requirements.

Tables Comparison

Conclusion

In conclusion, clone tables in BigQuery offer a flexible and efficient way to create new tables that share the schema and metadata of the original table. They are valuable for analysis, experimentation, and replications. By understanding the similarities and differences between clone tables, snapshots, and copy tables, you can choose the most suitable method for your specific needs.

--

--