Loading Complex Excel Files into BigQuery

Abhik Saha

Published in

Google Cloud - Community

7 min readApr 23, 2023

If you want to load an excel file into Google Cloud BigQuery, you can follow the below manual approach:

Convert the Excel file into a CSV format. You can do this by opening the Excel file and saving it as a CSV file.
Upload the CSV file into a Google Cloud Storage bucket. If you do not have a bucket, create one by following the steps mentioned in the Google Cloud Storage documentation.
In the BigQuery web UI or command-line tool, create a new dataset to store the data you will import.
Create a new table in the dataset with the schema that matches the data in your CSV file.
Use the BigQuery web UI or command-line tool to import the data from the CSV file in your Google Cloud Storage bucket into the new table in your dataset.
You can also use the Google Cloud Storage command-line tool to upload the file.

The automated approach

But when the excel file is a complex one (and has multiple tabs) and you need to load the files automatically, you can take the below approach.

You will need the below mandatory APIs enabled in your GCP account:

Google BigQuery

2. Google Cloud Storage

3. Google Cloud Function

4. Google Cloud Logging

5. Google Cloud Alerting (only if you need automated alerts on failure)

Steps to follow

Creating necessary GCS Buckets and folders

Create two storage buckets in Google Cloud Storage. The first bucket is for archiving and processing. The second bucket is used for Google Cloud Function GCS Bucket trigger location.

Please note that DO NOT use the landing bucket as the processing bucket as it will make the Cloud Function run in a cyclic manner.

2. Inside the processing bucket ‘my-org-processing-bucket’, create three folders in the below fashion.

a) The first folder is for archiving historical files.

b) The second folder is for generating the csv output post processing by the Pandas dataframe.

c) The third folder is where the file is brought from the landing zone for processing.

Three folders inside ‘*my-org-processing-bucket*’

Creating the Cloud Function

Create a Google Cloud Function with the following specification. In the “Runtime, build, connections and security settings”, set memory to 512MB, and maximum number of instances to one. Keep the Service account to ‘App engine default service account’.

My cloud function already exists, so it is giving the below error.

2. Deploy the below python code. The python code expects a sampledatahockey.xlsx file as input. The entry point of the Cloud Function will be ‘gcs_trigger_xlsx_loader’ function. Use Python 3.9 as your Python version number.

The Cloud Function will ingest a ‘sampledatahockey.xlsx’ file. More details about the file can be found here. You can download the sample file from the given link. The data is in the below format. It contains Olympic hockey data. We are supposed to read only the ‘PlayerData’ tab and insert it into the BigQuery table (using BigQuery client library) or generate a csv equivalent where it can be read by an external table in BigQuery.

import pandas as pd
from openpyxl import load_workbook
from datetime import date
from datetime import datetime
from google.cloud import storage
from google.cloud import bigquery
import gcsfs
import re
import io
import numpy as np

def move_blob(bucket_name, blob_name, destination_bucket_name, destination_blob_name):
    """Moves a blob from one bucket to another with a new name."""
    # The ID of your GCS bucket
    # bucket_name = "your-bucket-name"
    # The ID of your GCS object
    # blob_name = "your-object-name"
    # The ID of the bucket to move the object to
    # destination_bucket_name = "destination-bucket-name"
    # The ID of your new GCS object (optional)
    # destination_blob_name = "destination-object-name"

    storage_client = storage.Client()

    source_bucket = storage_client.bucket(bucket_name)
    source_blob = source_bucket.blob(blob_name)
    destination_bucket = storage_client.bucket(destination_bucket_name)

    blob_copy = source_bucket.copy_blob(source_blob, destination_bucket, destination_blob_name)
    source_bucket.delete_blob(blob_name)

    print(
        "Blob {} in bucket {} moved to blob {} in bucket {}.".format(
            source_blob.name,
            source_bucket.name,
            blob_copy.name,
            destination_bucket.name
        )
    )

def gcs_trigger_xlsx_loader(event, context):
    filename=event
    
    #the filename has a pattern of sampledatahockey.*xlsx. Process the file only when the pattern macth
    pattern=re.compile(r'sampledatahockey.*xlsx$')
    if pattern.match(event['name']): #event['name'] contains the name part of the event, here the filename
        print('Reading file {file}'.format(file=event['name']))
        
        try:
            print('Process execution started')
            hockey_file_load(filename)
            print('Process execution completed')
            
        except:
            print('Process failed')    
            exit(1)
    else :
        print('Error encountered. Filename does not match')


def hockey_file_load(filename):
    src_bucket_name_param='' #the bucket where the file is supposed to land
    op_bucket_name_param='' #the bucket where the file is supposed to process and archive
    project_name_param='' #the GCP project name
    
    fs=gcsfs.GCSFileSystem(project=project_name_param)
    client=storage.Client()
    bucket=client.get_bucket(op_bucket_name_param)
    
    #declare the filepaths
    absolute_path='gs://{bucket_name}/'.format(bucket_name=op_bucket_name_param)
    input_file_path='Source path/'+filename['name']
    output_file_path=absolute_path+'Output path/sampledatahockey_processed.csv'
    
    #the tabs which we are supposed to read. 
    required_tabs=['PlayerData']
    today=datetime.today().strftime("%Y-%m-%d %H:%M:%S")
    
    #copy the file from the source bucket to the processing bucket
    move_blob(bucket_name=src_bucket_name_param,blob_name=filename['name'],destination_bucket_name=op_bucket_name_param,destination_blob_name='Source path/'+filename['name'])
    
    
    
    #GCS cannot read a direct GCS filepath, so it must be declared as an instance of io.BytesIO class
    #io.BytesIO() is a class in Python's io module that creates an in-memory binary stream that can be used to read from or write to bytes-like objects. It provides an interface similar to that of a file object, allowing you to manipulate the contents of the stream using methods such as write(), read(), seek(), and tell()
    #One common use case for io.BytesIO() is when you want to work with binary data in memory instead of having to create a physical file on disk. 
    
    blob=bucket.blob(input_file_path)
    buffer=io.BytesIO()
    blob.download_to_file(buffer)
    
    #reading each tab in the list
    for sheet in required_tabs:
        wb=load_workbook(buffer)
        ws=wb[sheet]
        
        #Simple pandas operation
        print('Started processing sheet {}'.format(sheet))
        df=pd.read_excel(buffer,sheet,header=[2])
        df.columns=[x.lower() for x in df.columns]
        df.insert(0,"extract_date", today, True)
        df.drop_duplicates(inplace=True)
        print('Completed processing sheet {}'.format(sheet))
    
    
    #This part highlights the process if you want the file as a CSV output 
    #logic starts here
    client=storage.Client()
    
    extract_datetime=str(datetime.today().strftime("%Y-%m-%d %H%M%S"))
    source_dir='Source path/'
    output_dir='Output path/'
    target_dir='Archive path/archived_files_'+extract_datetime+'/'
    
    file_name_src=list(client.list_blobs(op_bucket_name_param,prefix=source_dir))
    file_name_opt=list(client.list_blobs(op_bucket_name_param,prefix=output_dir))
    
    print('Archiving files')
    for file_name in client.list_blobs(op_bucket_name_param,prefix=source_dir):
        filename=str(file_name.name)
        filename=filename.split('/')[-1]
        print(filename)
        
        pattern_src=re.compile(r'sampledatahockey.*xlsx$')
        if pattern_src.match(filename):
            move_blob(bucket_name=op_bucket_name_param,blob_name='Source path/'+filename,destination_bucket_name=op_bucket_name_param,destination_blob_name=target_dir+filename)
        else:
            pass
            
    for file_name in client.list_blobs(op_bucket_name_param,prefix=output_dir):
        filename=str(file_name.name)
        filename=filename.split('/')[-1]
        print(filename)
        
        pattern_src=re.compile(r'sampledatahockey_processed.*csv$')
        if pattern_src.match(filename):
            move_blob(bucket_name=op_bucket_name_param,
                      blob_name='Output path/'+filename,destination_bucket_name=op_bucket_name_param,
                      destination_blob_name=target_dir+filename)
        else:
            pass    
        
    df.to_csv(output_file_path, index=False)
    print('Generated files for Hockey data :{}'.format(output_file_path))
    #logic ends here
    
    
    
    #Direct load to BigQuery from the dataframe
    #logic starts here
    client=bigquery.Client()
    table_id_hockey='your-project-name.manual_input.hockey_data'
    job_config=bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
    
    job=client.load_table_from_dataframe(df,table_id_hockey,job_config=job_config)
    job.result()
    
    table=client.get_table(table_id_hockey)
    print('Truncated and loaded {} rows into {} table.'.format(table.num_rows,table.table_id))
    #logic ends here

# The requirement.txt must contain following dependencies
# Function dependencies, for example:
# package>=version
pandas==1.4.2
numpy
openpyxl
datetime
google-cloud
gcsfs
google-cloud-bigquery
google-cloud-storage
pyarrow

Explaining the Python Code

First check whether the file matches the pattern of the filename expected. If matches, then only read the file, else generate an error message.
If it matches, declare the landing processing and archiving directories.
The initialize a pandas dataframe and load the tab.
Process the file in the pandas dataframe.
Generate a csv output from the dataframe in the Output folder. This file can be read by a BigQuery external table.
Specify the archiving directory paths and the code for archiving historical ‘csv’ as well as ‘xlsx’ files that will get generated daily.
Call the BigQuery client libraries if you want to load the file directly into BigQuery using the client libraries.

Checking the Entire Workflow

Drop the ‘sampledatahockey.xlsx’ file in the ‘source-file-landing-bucket’.
Check the Cloud Function logs for process completion.

3. We can see that the csv file is generated. Also, the data is uploaded to BigQuery. Let us verify the same.

4. The entire workflow is successful.

5. Now to load the csv file into BigQuery using an external table, we can use the below code. This script can be integrated into a Stored Procedure as well which can later be scheduled.

CREATE OR REPLACE EXTERNAL TABLE `plated-field-383807.manual_input.hockey_date_ext`
    OPTIONS(
      field_delimiter = ',',
      format = 'CSV',
      allow_jagged_rows=true,
      allow_quoted_newlines=true,
      ignore_unknown_values=true,
      skip_leading_rows=1,
      URIs = ['gs://my-org-processing-bucket/Output path/sampledatahockey_processed.csv']
    );

select * from `plated-field-383807.manual_input.hockey_date_ext`;

The data is present in the external table as well.

Follow me on LinkedIn and Medium to get more content like these!

Loading Complex Excel Files into BigQuery

The automated approach

Steps to follow

Checking the Entire Workflow

Written by Abhik Saha