Gspread in Airflow

Camila Marques de Oliveira
6 min readJul 14, 2023

--

Leia esse post em português aqui

Imagine that you need to manually update a spreadsheet every day with data from a specific source. Have you ever thought about automating the update of this spreadsheet?

Yes, we know that using a spreadsheet as a database is not the most recommended way to store your data, but we know that it can happen.

So, I wrote this article thinking about this problem, a way for you to automate the update of a spreadsheet would be creating a DAG that is configured to run at regular intervals and has a task that updates that spreadsheet.

Thus, I divided this article into three sections:

1. Service Account Configuration

2. Airflow Configuration

3. DAG Creation

1. Service Account Configuration

To use the Google API in order to manipulate your spreadsheet, you need to create a Service Account. So, follow the steps below to create a Service Account in Google.

First, go to the Google Cloud Platform website, click on the project, and then on New Project, as shown in the screenshot below:

Then give a name to your project and click Create.

Note that here it will change to the name of your project, which you have just created:

The next step is to enable the APIs we want to use in this project, so go to APIs & Services, and then Enable APIs & Services:

Here, we will enable two APIs:

  • Google Drive API
  • Google Sheets API

1.1 Creating Credentials for Google Sheets API

Now we need to create the credentials to use the Google Sheets API, so search again for Google Sheets API and click on Manage:

Then go to Credentials and then Manage Service Accounts:

Then click on Create Service Account:

Choose a name for your service and click Create and Continue:

Then click on Actions and Manage Keys:

Click on Add Key and Create New Key:

Choose the JSON format and click Create. The download of your credentials will be done automatically.

1.2 Sharing the Spreadsheet with your Service Account

Go to Service Account Details and copy the email of your Service Account:

Create a new Spreadsheet and share it with the copied email:

2. Airflow Configuration

The way I chose to run Airflow on my machine was by running the image that the Docker Hub provides for Airflow. So, to use it this way, you need to have Docker Compose installed on your machine according to your operating system.

Here you will find the docker-compose.yml if you want to run Airflow through Docker.

The Gspread library is not in the docker-compose.yml, so how do we install new libraries in the Docker image? By creating a Dockerfile. This way, the libraries will be installed according to my operating system. Create a Dockerfile with the following configuration:

FROM apache/airflow:2.5.1

USER airflow

RUN pip install gspread

If you are using a Debian-based operating system, such as Ubuntu, you will probably need to use USER root before the libraries you want to install with apt-get and then switch back to the USER airflow. If you want to configure your Dockerfile differently, just follow what is in the documentation.

Following the documentation, in order to install everything that is in my Dockerfile, the next step is to comment out the line with the image name and uncomment the line that builds my docker-compose.yml:

# image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.5.1}
build: .

Once this is done, we can run docker-compose up -d.

2.1 Adding Google Connection to your Airflow

Open localhost:8080 in your browser to access the Airflow UI. The default username and password are airflow.

Go to admin, connections, and then click to add a new connection:

  • Connection Id: Here you will put the name you prefer for your connection
  • Connection Type: You should select the Google Cloud option
  • Project Id: You will enter the id of your project
  • Keyfile JSON: Here you will paste the credentials that you downloaded
  • Scopes: paste https://www.googleapis.com/auth/drive

And that’s it! You can click on save.

Then click on Test to check if your connection is ok:

2.2 Errors when testing the API connection

When testing my API connection, I encountered the following error:

('invalid_grant: Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.', {'error': 'invalid_grant', 'error_description': 'Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.'})

When researching the error, I found that it was related to the difference in time zones. After checking if the time on my machine was synchronized with my time zone, I found that I should add the information indicating to use my time zone in the docker-compose.yml. So I added these two lines to synchronize the time of the image with my time zone:

volumes:    
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro

After making this modification in the docker-compose.yml, I was able to successfully test my connection.

3. DAG Creation

So I created a simple DAG just to test sending data from a dataframe to Gsheets using Airflow.

The first step was to create a dataframe with product prices:

data = {
"products": ["product_1", "product_2", "product_3"],
"price": [50, 40, 45],
}

df = pd.DataFrame(data)

To use the API connection I added to Airflow, I used the GoogleBaseHook :

hook = GoogleBaseHook(gcp_conn_id="google_conn_id")
credentials = hook.get_credentials()
google_credentials = gspread.Client(auth=credentials)

And finally, using gspread, we specify the name of the Spreadsheet we are using, the name of the worksheet, and then send the dataframe data to the worksheet.

sheet = google_credentials.open("Products - Data")

worksheet = sheet.worksheet("products-data")

worksheet.update([df.columns.values.tolist()] + df.values.tolist())

This is the complete code of the DAG:

from airflow import DAG
from airflow.providers.google.common.hooks.base_google import GoogleBaseHook
import pandas as pd
import gspread
from datetime import datetime


with DAG(
dag_id="dataframe_to_spreadsheet",
start_date=datetime.now(),
schedule_interval="@daily",
) as dag:

@dag.task
def dataframe_to_spreadsheet_task():
data = {
"products": ["product_1", "product_2", "product_3"],
"price": [50, 40, 45],
}

df = pd.DataFrame(data)

# Hook to Google Sheets in order to get connection from Airflow
hook = GoogleBaseHook(gcp_conn_id="google_conn_id")
credentials = hook.get_credentials()
google_credentials = gspread.Client(auth=credentials)

# Reading a spreadsheet by its title
sheet = google_credentials.open("Products - Data")

# Defining the worksheet to manipulate
worksheet = sheet.worksheet("products-data")

# Sending data from df to the worksheet
worksheet.update([df.columns.values.tolist()] + df.values.tolist())

dataframe_to_spreadsheet_task()

Then we have the updated worksheet:

If you want to explore other functionalities of the gspread library, you can access the documentation here. You can find all the project on my GitHub repository.

--

--