Batching Web Data into a Postgres Database Hosted on Google Cloud

7 min readFeb 10, 2023

It’s important to read the previous article before diving into this one as it serves as a direct sequel. In the previous article, I explained how to set up and configure a Postgres database by using a locally hosted Docker image and a managed database service provided by Google Cloud SQL. In this article, I’ll guide you through the process of setting up a pipeline that batches data into the Postgres database.

Project Structure

Introduction
Setting up the Environment
Setting up PGADMIN
Building the Docker Image
Deploying the Container on Cloud Run
Conclusion

Configuration of a Postgres Database on Google Cloud

In this project, we will set up and connect to a Postgres database by setting it up locally (on the remote instance we…

medium.com

Deployment of Containerized Data Applications on Google Cloud Run (pt1)

In this project, a yahoo finance web scraper is built using python, selenium, and beautiful soup. An image of this…

medium.com

GitHub Repo:

GitHub - paulonye/DockerXPostgres: In this project, a postgres cloudsql is set up and connected to…

In this project, a postgres cloudsql is set up and connected to a docker pgadmin container. Scraped data is written…

github.com

Introduction

Batching data into a database is a common task in the industry. The key to success is to have a good understanding of how to communicate with the database and access to appropriate computing and storage options. In the previous article, we created two databases, one locally and the other on Google Cloud SQL. We also confirmed and tested the connections to these databases using pgcli. Now, in this article, we will write data into the Postgres Cloud SQL database that you should have set up previously.

Before proceeding, it is important that you navigate to Navigation Menu > SQL > Instances. Click on the Instance Id you created. Ensure you get the following details: User, Database, Hostname (Public Ip Address), and Password.

You can test the connection again using:

pgcli -hostname 34.453.456.34 -p 5432 -u user -d database_name

Setting up the Environment

Now that the Postgres database has been configured, let’s dive into the Python scripts. We’ll be using four Python scripts in total, two of which will act as modules and the other two as main scripts to perform the necessary tasks. We’ll be conducting two jobs, a migration batch job, and a batch job that writes data into the Postgres database every hour. These two jobs will write data into the same table.

To get started clone this repo into your local:

git clone https://github.com/paulonye/DockerXPostgres.git

Navigate into the DockerXPostgres/ directory and install the required libraries using pip install -r requirements . Once this is done, you can then navigate into the app/ directory to view the python files.

Before proceeding to run the python files, you need to set up a .env file that contains the database credentials and also the service account key to read data out of your Google Sheets.

You can do this by creating a .env file using touch .env and adding the following to the file.

PGUSER=user
PGPASS=******
HOST=**********
DB=database_name
key_file=key.json

For you to be able to access the environmental variables from the .env file you need to install a python package called dotenvusing:

pip install python-dotenv

You don’t need to run this again because it was included in the requirements.txt file. You can examine the connect.py script to see how the package is used, and how you can use os.getenv to read your environmental variables from the .env file.

Then add this to the script that requires the environmental variable:

import os
from dotenv import load_dotenv
load_dotenv()

#To get the hostname
host = os.getenv('PGUSER')

Migrating Google Sheets Data into the Postgres Database

In the second part of this series, we constructed a data pipeline that scraped data from Yahoo Finance and loaded it into Google Sheets. If you completed that project, you should have accumulated around 5,000 rows of data by now. However, I have about 30,000 rows in my Google Sheets, and I want to transfer that data to a more secure storage location, specifically the Postgres database.

Building an ETL Pipeline Using Google Service Accounts

In this project, financial data from Yahoo Finance is scraped using Selenium, transformed, and imported into a Google…

medium.com

Navigate to the app/ directory in the repository and open the batch.py file to see the code. This script has two functions: (i) Reads the data from the Google Sheets and (ii) Loads the data into the database. The script uses the get_connection function in the connect.py file to connect to the database. Ensure that your .env file has been configured correctly. You will need to add the directory of your service account key and add it to the .env file you will create. I saved the service account key variable name as key_file and this is what I referenced in the batch.py file.

You can test the connect.py file is working by running it using:

python connect.py

Once you have confirmed the connection, you can then run the batch.py file. You will only need to run this once to migrate the data into the database.

Setting up the Main Pipeline that scrapes Data into the Postgres Database

Once the data in the Google Sheets has been batched, you can then proceed to set up the main pipeline script main.py . The script uses functions from connect.py and scrape.py . You can run the script using python main.py to test if the script works. And then confirm using pgcli to see if the number of rows in the table increased.

Setting up PGADMIN

In the previous article, I talked about how to set up PGADMIN using a docker image. Pgadmin, unlike pgcli gives you access to a web interface that allows you to communicate with the database.

To get started with PGADMIN using docker, run the command below on your terminal:

docker run -it \
    -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
    -e PGADMIN_DEFAULT_PASSWORD="root" \
    -e PGADMIN_COOKIE_SECRET=53tegevd66efdvfff\
    -p 8080:80 \
    -d \
    dpage/pgadmin4

To confirm that the container is running, try docker ps . The next step would be to launch the web interface of the PGADMIN, for this, you have to port forward using VScode as the bridge between your remote machine and your local machine.

Open your VScode and connect to the remote instance. Go to the port session and click on Forward a Port. Type in 8080, as this is the port the PGADMIN docker container is currently running on.

Copy the local address link provided as seen above and paste it into your browser. This will load the PGADMIN login interface; input the email address and password you configured in your docker command. Once the page has loaded, follow the steps below:

Right-click on ‘Servers’ at the upper left corner and select Register > Server.
On the General Tab, create a Random Name of your choice.
On the Connection Tab, fill in the details as seen in the image below, also input the password.
Click on Save.
Navigate to Databases > Database you created > Schemas > Tables to view the tables you created.
Right-click on the table you created and select the Query Tool.

You can go ahead and play around with the query tool.

Note: We are able to connect to the database using its external Ip address only because we had initially created a firewall rule that allows any network to connect to the database. However, this is not the best practice. It’s usually better to specify the IP address that should have access to the database for security reasons.

Deploying the Container on Cloud Run

Now that we have tested the scripts and confirmed that the connection to the database is working, the next step is to deploy the application on Google Cloud Run. You should be familiar with the process, I covered the steps to deploy an application on Google Cloud Run extensively in my third article.

Here’s a summary of the steps for your reference:

Build the Docker image — you can find the Dockerfile in the repository.
Push the Docker image to Google Artifact Registry.
Deploy the image using a Google Cloud Run job.
Set up a scheduler using Google Scheduler.

Conclusion

In conclusion, we have covered the steps to build a data pipeline that batches data into a Postgres database. We started by setting up and configuring a Postgres database using a local docker image and Google CloudSQL. Then, we wrote Python scripts to perform two jobs: a migration batch job and a batch job that writes data into the Postgres database every hour. Finally, we deployed the application on Google Cloud Run and set up a scheduler using Google Scheduler.

By following these steps, you should now be able to transfer data from your Google Sheets into a secure storage location, the Postgres database, in an automated and efficient manner. This pipeline can serve as a starting point for you to build upon and customize according to your needs. The possibilities are endless, and with the power of cloud computing, you can take your data processing to new heights.

Next Article:: Using Prefect Cloud to Orchestrate your Data Pipelines.

Batching Web Data into a Postgres Database Hosted on Google Cloud

Project Structure

Previous Article:

Configuration of a Postgres Database on Google Cloud

In this project, we will set up and connect to a Postgres database by setting it up locally (on the remote instance we…

Deployment of Containerized Data Applications on Google Cloud Run (pt1)

In this project, a yahoo finance web scraper is built using python, selenium, and beautiful soup. An image of this…

GitHub Repo:

GitHub - paulonye/DockerXPostgres: In this project, a postgres cloudsql is set up and connected to…

In this project, a postgres cloudsql is set up and connected to a docker pgadmin container. Scraped data is written…

Introduction

Setting up the Environment

Migrating Google Sheets Data into the Postgres Database

Building an ETL Pipeline Using Google Service Accounts

In this project, financial data from Yahoo Finance is scraped using Selenium, transformed, and imported into a Google…

Setting up the Main Pipeline that scrapes Data into the Postgres Database

Setting up PGADMIN

Deploying the Container on Cloud Run

Conclusion

Written by Paul Nwosu