Building an ETL Pipeline Using Google Service Accounts

Paul Nwosu
11 min readJan 20, 2023

In this project, financial data from Yahoo Finance is scraped using Selenium, transformed, and imported into a Google Sheets spreadsheet. The imported data will then be connected to Google Data Studio for visualization.

This is the second part of a series I started called “Building your First Google Cloud Analytics Project.” To fully understand the project, it’s important to read the first article on how to set your environment. Find the link below:

https://medium.com/@nwosupaul141/optimizing-googles-cloud-infrastructure-for-data-engineering-and-analytics-49d1d91fe7b6

Project Structure

  • Connecting to your Remote Instance
  • Enabling the Google Sheet API
  • Setting up a service account and downloading the key
  • Basics of SFTP
  • Setting environment variables
  • Installing all the required libraries
  • Building the Scripts
  • Testing
  • Setting up a cron job
  • Loading the Google Sheet Data into Looker Studio

Connecting to your Remote Instance

In my previous article, I demonstrated how to set up a cloud environment on Google Cloud for data-related projects. In this project, we will use the remote instance we created to build a data pipeline. To connect to the remote instance from the terminal of your local machine, type in the command ssh alias_name, where “alias_name” is the name you assigned to your remote instance when setting up your config file in the .ssh folder on your home directory. Once connected to the remote instance, the next step is to clone the repository for the project.

Cloning the repository is optional but it will make it easier for you to follow along as I explain the project.

git clone https://github.com/paulonye/Smart_Sheet

Enabling the Google Sheet API

In this project, we will be loading financial data from Yahoo Finance into Google Sheets. Google Sheets is a Software as a Service tool owned by Google, so as a Google Cloud user, you have access to its API. To get started, navigate to console.cloud.google.com. Please note that you must have selected the project in which you created your remote machine as all project resources are tied together. In the console, click the Navigation menu > APIs & Services. Click on Library and search for the Google Sheets API. Once you find it, click on it and enable it. You can also search for the Google Drive API and enable it.

Setting up a service account and downloading the key

Google Service Accounts are a way for applications to authenticate to the Google Cloud Platform (GCP) APIs and services, using a set of predefined credentials. These credentials are managed by an associated project and can be used for a variety of purposes, such as granting permissions to specific resources and services, or for application-to-application communication within a GCP project. The authentication process is based on JSON Web Tokens (JWT), which allow applications to access resources without user intervention. In this project, we are going to use a service account to grant us authenticate access between python and a google spreadsheet that we will create.

To create a service account, click the Navigation menu > IAM & Admin > Service Accounts. Click on create a service account. Select a name for the service account, grant the service account an editor role (From Basic), click on continue, and DONE.

Now that the service account has been created, click on the triple dots at the end of the row, and then select Manage keys. Select ADD KEY >Create a new key, select JSON, and then create. Once this key has been created, it automatically downloads to your local machine. Since we will be using our remote instance as our development environment, we need to transfer the key into our remote instance.

Service Account Portal

Basics of SFTP

SFTP stands for Secure File Transfer Protocol. It uses SSH (Secure Shell) to provide secure authentication and encryption for data transferred over the network. SFTP can be used to transfer files between computers, servers, or even between different cloud platforms using simple commands like get and put. SFTP is very similar to the way SSH works. Recall that to log into a remote instance from your local machine, you need to first set up a config file with the external IP address of the remote instance. This allows you to use the command ssh alias_name and log into the remote instance. To connect to the remote instance using SFTP, you use the command sftp alias_name. Note that before running the command, you must be in the directory of the file you want to transfer to the remote instance. Once you are in that directory, you can sftp into your remote instance and use the command put filenameto copy the key file and place it in the home directory of the remote instance. To copy a file from the remote machine to your local machine, navigate to the directory of the file and then use the command get filename, this copies the file into your local machine.

Setting Environment Variables

Environmental variables are system variables that store information about the environment in which a computer program or process runs. They can be used to configure various aspects of the program, such as the location of files or settings that the program needs to access. Environmental variables can be set and accessed using the command line or through various programming APIs.

In this project, we will use an environmental variable to store the service account key. This is done for security reasons. It is important to note that in a professional setting, it is not best practice to give a service account editor role to a project, as the rule of least privilege should be employed.

Let’s assume that the name of the key is test_1324.json. To set this as an environmental variable, run the following bash commands:

#create a hidden directory in the home directory and move my service account key there
mkdir .service_account_key && mv test_1342.json ~/.service_account_key/test_1342.json

#navigate to the directory to confirm that the key is there
cd .service_account_key

#Return back to the home directory
cd ~

#Open the .bashrc to set your key as an environment variable that can be accessed from anywhere
#for linux, the .bashrc file is located in the home directory
nano .bashrc

#The above command opens the Nano editor, Scroll down and then edit the file by adding
export key_file='/home/macbook/.service_account_key/test-1342.json'

#Note: macbook is the name of my user, so change this to reflect yours

#The command below is used to ensure that the changes take effect; restart your remote instance once you run the command
source .bashrc

#To confirm that your environment variable was set correctly; use the command below:
echo $key_file
#this should print out /home/macbook/.service_account_key/test-1342.json

Installing the Required Libraries

For this project, we will be using some important libraries that are not pre-installed with Python. If you had cloned the git repo as I instructed earlier, you can navigate into the directory and run the command pip install -r requirements.txt. This will install all the necessary libraries including Selenium and Beautiful Soup.

Building the Scripts

In the current directory, you will find the following scripts main.py sheet_connect.py send_mail.py push.py & push_append.py. The first four are used as modules to aid the last script push_append.py

main.py: This python script is used to extract the data from Yahoo Finance. It uses selenium and beautiful soup which are both web scraping tools to pull the data from the website and save it as a python data frame.

sheet_connect.py: This python script contains the functionality that sets up the connection from Python to Google Sheets.

send_mail.py: This python script was set up for logging and error management. It contains functionalities that inform us when the pipeline breaks, and what caused the break.

push.py: This python script was used to push the first batch of data from yahoo finance to Google Sheets. It is mainly used to test if the connection between Python and Google Sheets actually works. Before running the push_append.py script, it is important to run this script first so that you can test if the connection works.

push_append.py: This is the Python script that combines the functionality of all the other scripts. The other scripts are imported as modules and their various functionalities are combined to ensure that the data is scraped, transformed, and transferred to Google Sheets.

Understanding how Selenium and Beautiful Soup

Selenium and Beautiful Soup are tools that are both used for web scraping, I consider selenium to be more powerful than beautiful soup because it can interact with a web page just as a human would, such as clicking buttons, filling out forms, and navigating between pages. For example, if you wanted to scrape data from a website, you could use Selenium to open the website, navigate to the page where the data is located, and then extract the data from the page.

For Beautiful soup, the process is quite different. It allows you to parse HTML and XML documents, and extract useful information. The library creates parse trees from the HTML or XML documents, which can be used to extract data easily. You can use Beautiful Soup to find specific tags, navigate through the parse tree, and extract the data that you need.

For this project, both Selenium and Beautiful Soup were used for scraping data. Selenium was used to load the webpage and save it as a static page, which is then passed to a Beautiful Soup object. From the Beautiful Soup object, I then identified the tags that contained the data needed.

Here is a snap of the data we will use, see the link below: https://finance.yahoo.com/crypto/?.tsrc=fin-srch&offset=0&count=15

Yahoo Finance Website

It should be noted that the link used was adjusted to display only the top 15 companies based on Market Cap. That’s something to keep in mind when scraping, always try and see if you can manipulate the URL to your advantage. To find the tag of the data you want to scrape, right-click on the element and click on “inspect” in the browser’s developer tools.

Yahoo Finance Website

From the image you provided, after right-clicking on 410.19B and choosing “inspect”, we can see the details of that element. We can see that the element begins with the tag “td”, which stands for table data. There is also another unique identifier, the “aria-label”. Using these two elements, we can scrape that entire column. We will also do this for the price and symbol columns. You can refer to the main.pyscript to see the code.

Error Management

An important step when building pipelines is to account for failures. For this project, we will set up a script that informs us by sending an email to a Gmail account whenever our script breaks. The smtplib module is used to set this up. Refer to the send_mail.py to see the code. There are some things you need to do first.

  • Navigate to the security settings of your Gmail account
  • Enable 2-step verification
  • Set your App password

You will need the email address and the app password when setting up the function to send the mail.

Note: You will need to replace the email address and the password on the send_mail.py script with yours.

Testing the Script

In this workflow, we are scraping data and loading it into Google Sheets using a service account as the connector, when you create a service account, it creates an email address associated with the service account. Go back to the google console Navigation > IAM & Admin > Service Account, look for the service account you created, and copy the email address.

Follow these steps:

  • Create a folder in your google drive
  • Create a random Google Sheet called ‘test_sheet’ (This is the name of the sheet I used in the code)
  • Once the sheet has been created, on the upper right corner, click on share, and add the email address of the service account, make sure you give it edit access to the sheet.
  • Rename the worksheet tab ‘data’

To test that the connection is working, navigate to the directory that contains the script, and run python sheet_connect.py in the command line. This should give an output ‘data’.

If you review the sheet_connect.py script you will notice that the environmental variable containing the service account key was imported and used for authorization.

When you are sure that the connection between your environment and the Google Sheets has been established, run the ff command

#Push the first batch to test
python push.py

#Append new batches
python push_append.py

Setting a cron job

The final step in building this pipeline is to set up an automated way of scraping the data and appending it to your Google Sheet continuously. You can achieve this by opening the crontab and creating a new cronjob.

To open the crontab, use:

crontab -e

Navigate to the last line and write the cronjob

* * * * * export key_file=/home/macbook/.service_account_key/test_1324.json; /home/macbook/anaconda3/bin/python3 ~/Smart_Sheet/push_append.py

Save the crontab by pressing Ctrl+O and then closing it by pressing Ctrl+x.

The cronjob above does three things:

* * * * *: schedules the script to run every minute. You can use this website https://crontab.guru/ to schedule your cronjobs. I advise you to set this to every hour 0 * * * * .

export key_file=/home/macbook/.service_account_key/test_1324.json: sets the environmental variable before running the script because cronjobs operate on a different terminal and can’t read the environment variable from the .bashrc file.

/home/macbook/anaconda3/bin/python3 ~/Smart_Sheet/append.py: uses python from the anaconda directory because we didn't install python locally.
If we did then the command would have been:

python3 ~/Smart_Sheet/append.py

When setting cronjobs, it is important that you use the absolute directory of any file you will be using or referencing.

Loading the Data into Google Data Studio

Google Data Studio is a free data visualization tool that allows users to connect to various data sources, such as Google Sheets, Google Analytics, and BigQuery, and create interactive reports and dashboards. It has a wide range of features that allows users to easily create and customize charts, tables, and other visualizations. It also allows us to easily share the reports with others, and collaborate on the design and data. Additionally, Google Data Studio is fully integrated with the Google ecosystem and can be easily connected with other Google tools such as Google Analytics, Google BigQuery, and Google Ads. It’s a powerful tool that allows businesses to turn raw data into actionable insights and make data-driven decisions. It’s also a great tool for creating custom reports for clients or stakeholders.

  1. Open Google Data Studio and create a new report.
  2. Click on the “Add a data source” button on the top right corner of the screen.
  3. Select the “Google Sheets” option from the list of data sources.
  4. Select the Google Sheet that contains the data you want to use in your report.
  5. Click on the “Connect” button to connect the data source to the report.
  6. Drag and drop the fields from the data source onto the report to create charts and tables.

You can now use the data in your Google Sheets to create charts and tables in Google Data Studio, and share the report with others if you need to. In another medium article, I will delve deeply into creating dashboards in Google Data Studio.

That’s the end of the first project in this Building Data Projects On Google Cloud. Here is a brief summary of everything we did.

  • We set up a service account and assigned it with appropriate permissions.
  • We scraped data from yahoo finance using selenium and beautiful soup and applied a little bit of transformation.
  • We authenticated our application to connect to Google Sheets using the service account key.
  • We set up a cronjob to allow us to scrape and batch the data to our Google Sheets on an hourly basis.
  • We set an error management and logging process to allow us to receive emails if the pipeline script breaks.
  • Lastly, we connected the Google Sheet to the Google Data Studio (Looker Studio) for data visualization.

I hope you enjoyed this project and were able to follow along. Check out the git repository for more information.

https://github.com/paulonye/Smart_Sheet

The next project is going to be more exciting as we delve into the world of containers and deployments. Stay tuned for further updates.

--

--