GA4 raw data to Snowflake

Albert Franzi
Albert Franzi
Published in
7 min readApr 12, 2024

From Google Analytics to the Cloud and Beyond: How we successfully integrated GA4 with BigQuery, streamlined Data into S3 and Iceberg, and made it all available in Snowflake.

GA4 data flow into Snowflake

Context

Welcome to our journey! As we dive into the world of data, it’s crucial to handle it well from start to finish — gathering, storing, and analyzing. We took on a challenge: making Google Analytics 4 (GA4) talk to BigQuery, then neatly tucking that data into Amazon S3 and Iceberg, and finally making sense of it all through Snowflake. This integration wasn’t just about keeping our data tidy; it was about unlocking insights that can make a difference.

In this article, I’ll walk you through how we connected these dots. We’ll explore each step — from the very first data export in GA4 to the deep dives into analysis with Snowflake. I’ll share the tricks and tools that helped us along the way and how each part of our setup contributes to smarter data handling.

Whether you’re wrestling with similar data challenges or just curious about how it’s done, I hope our story helps and inspires you. Let’s get into how we turned clicks into actionable insights!

Journey

1. Connect GA4 with BigQuery

Connecting Google Analytics 4 to BigQuery allows you to access and analyze your analytics data with greater flexibility and depth than is possible directly in GA4.

Note: Before setting up the export, ensure that you have admin access to both your Google Analytics and Google Cloud Platform accounts. This access is necessary to make the required configurations in GA4 and BigQuery.

After you’ve made your selections and reviewed the settings, submit your configuration. Google will start exporting data to the specified BigQuery dataset at the frequency you’ve selected (Daily vs Streaming).

Note on region selection: Since we are not using the gcp enterprise plan, we cannot use an aws-omni region, therefore we chose the europe-west3 region to comply with our data storage regulations. Remember, the chosen region should align with your project's data regulation requirements and geographic location to minimize latency and costs.

2. Creating a GCP Service Account for Cross-Platform Authentication

Given that our Kubernetes (K8s) cluster is hosted on the AWS platform, we need a mechanism to authenticate and authorize access to the Google Cloud Platform (GCP) resources, including the newly created analytics dataset in BigQuery. By creating a GCP service account, we establish a secure method for cross-platform authentication, enabling seamless access to GCP resources from the AWS side.

Step-by-step:

Role Assignment: Assign the ‘BigQuery User’ role to the service account to grant it the necessary permissions for executing queries on the analytics dataset.

5. Upload GCP Credentials into AWS Secrets Manager. This step ensures that the key can be retrieved securely by your Kubernetes cluster without storing it in less secure configurations like Kubernetes secrets or directly in container images.

3. Retrieving GA4 Datasets from BigQuery

For this mission, we will use python-bigquery-pandas (pandas_gbq), allowing us to authenticate with the SA key and access our analytics dataset remotely by downloading it as a Pandas DataFrame.

pip install google-api-python-client pandas-gbq

With the authentication set, connect to BigQuery to run queries. Use the credentials obtained to configure your client.

Note: Since we want to be memory-conservative, we decided to partition the data using the event_timestamp as a reference.

Alternative: If we were in enterprise plan with GCP, we could opted for creating an omni region and creating an schedule Data Transfer between Bigquery and S3 omni-aws-create-connection & omni-aws-export-results-to-s3.

4. Validating GA4 Schema Using Pandera and Pydantic

The purpose of data validation in our pipeline is crucial. It ensures that the data extracted from GA4 and processed through BigQuery meets our quality standards before it’s used for further analysis or stored in other systems. This step helps prevent issues related to data quality, such as incorrect data types, missing values, or format inconsistencies, which can significantly impact downstream processes and decision-making.

To implement robust data validation, we use Pandera in conjunction with Pydantic. Pandera provides a flexible and expressive API for DataFrame validation, making it easy to define schema constraints and check data. Pydantic adds an additional layer of type checking and data parsing, which is particularly useful for handling complex and nested data structures typical in GA4 data.

Literature: [GA4] BigQuery Export schema & [GA4] BigQuery Export user-data schema.

Improvement: To make our data even nicer, we took advantage of the Pydantic Validators.

By incorporating Pydantic and Pandera for schema validation, you ensure that the data adheres to expected formats and meets the quality standards required for reliable analytics. This setup not only reinforces the integrity of your data pipeline but also automates checks that might otherwise be manual and error-prone, significantly enhancing the robustness and reliability of your data handling processes.

5. Uploading Clean Data to S3 + AWS Glue Iceberg Tables

AWS Wrangler is a powerful utility that simplifies data operations on AWS services like S3, Glue, and Redshift. For uploading data to S3 and creating Glue Iceberg tables, AWS Wrangler provides streamlined functions that handle underlying complexities.

If the table doesn’t exist yet in the Glue Catalog, awswrangler will create it for us inheriting the schema from the DataFrame.

6. Creating an External Iceberg Table in Snowflake

The objective of this section is to create an external Iceberg table in Snowflake. This setup allows us to utilize Snowflake’s capabilities to perform advanced analytics on our GA4 data stored in AWS Glue Iceberg tables. By linking Snowflake with AWS Glue, we leverage the best of both platforms: the robust data handling and scalability of AWS and the advanced querying and analytics features of Snowflake.

a. Create a Catalog Integration: This integration will connect Snowflake to the AWS Glue catalog, allowing Snowflake to understand the metadata of the Iceberg tables stored in AWS Glue.

b. Create an External Volume: This step specifies where Snowflake should look for the data files associated with the Iceberg tables.

c. Creating the Iceberg Table in Snowflake: Once the integrations are in place, the next step is to create the Iceberg table which will point to the data stored in AWS Glue.

7. Refreshing Snowflake Iceberg Metadata

Iceberg tables in Snowflake store metadata about the data files and their organization. As new data is added or existing data is modified in AWS S3 through AWS Glue operations, the metadata stored in Snowflake becomes outdated. Regularly refreshing this metadata ensures that Snowflake queries return complete and up-to-date results.

[Option A]: Manual Refresh: For ad-hoc refreshes or initial testing, you can manually trigger the metadata refresh using the following SQL command in Snowflake:

[Option B]: Automated Refresh with Airflow: To automate the process and ensure that metadata is refreshed at regular intervals or after specific events (like data uploads), you can set up an Airflow job with the SnowflakeOperator.

Conclusion and Invitation for Feedback

We’ve now walked through the comprehensive process of integrating Google Analytics 4 with BigQuery, migrating data efficiently to AWS S3 using Glue Iceberg, and finally, leveraging the powerful capabilities of Snowflake to manage and analyze that data. This journey underscores our commitment to leveraging state-of-the-art technology to enhance our data workflows, providing actionable insights that drive business decisions.

As we continue to refine and evolve our data pipeline, we are keen to learn from the community. Your experiences, insights, and feedback are invaluable as they help us identify opportunities for improvement and innovation. Whether you are a data engineer, a business analyst, or just someone passionate about data, we would love to hear your thoughts:

  • What has been your experience with similar data integrations?
  • Are there other tools or methods you’ve found effective?
  • Do you have suggestions on how we might further optimize our current setup?

Please share your feedback in the comments below or reach out to us directly. Let’s collaborate to make data handling as efficient and insightful as possible.

Thank you for taking the time to explore this process with us. We look forward to your contributions and to furthering the conversation around advanced data engineering solutions!

--

--