Data Migration: Simple Moving Data from On-Premises to Google Cloud (Technical)

A Simple Strategy for Data Migration

Zidane Ibrahim
EDTS
7 min readJun 26, 2024

--

One common problem with On-Premise servers for Big Data is the limitation of data storage and processing. If the volume of data grows faster than expected, the On-Premise infrastructure may not be able to handle it so it is common to move the data to a new infrastructure with larger storage. However, the high cost, large server space, and time-consuming implementation and migration can consume a lot of your (or your company’s) resources.

The one of solution that can be well considered for Big Data is using the Cloud. When you use the Cloud, you don’t need to think about the implementation and maintenance of the infrastructure. The costs incurred for using the Cloud is for now, while for On-Premise are costs for the future. If you use it as needed, you will save money. Google Cloud BigQuery (BigQuery) is one of the Cloud services for data storage.

BigQuery is a fully managed service and serverless data warehouse provided by Google Cloud Platform (GCP). The advantages of BigQuery include the ability to store data without worrying about storage limitations and having fast data processing capabilities. In this article, we will show you the technical simple data migration process from a PostgreSQL database on an On-Premise to BigQuery, start from solution strategy until load data into BigQuery.

Data Migration Technical Strategy

Before the migration process, we need to make sure a few things that Google Cloud SDK is installed and connected to the google cloud project destination on Linux at On-Premise Server or Local Server and make sure On-Premise is connected to the internet and your GCP account. These two things are our preparations before migrating.

The data migration strategy from On-Premise to Cloud is done using gsutil. The use of gsutil in this migration process is considered simpler because it only requires a stable internet connection and the costs incurred are only when storing in GCS.

Figure 2. Data Migration Strategy Diagram.

The diagram of the data migration strategy is shown in Figure 2. This simple process starts by exporting the data from PostgreSQL into a .txt file. The data is then uploaded to Google Cloud Storage (GCS) using bash command. After that, the file from GCS is loaded into BigQuery.

The export of data into .txt format files was done because gsutil is not designed to move or manage databases directly. Therefore, the database needs to be converted into an object first, such as a .txt or .csv file. Further discussion about this export will be done in the next section.

To make it easier for us to illustrate the flow, an example of the data that we will migrate for the entity relationship diagram (ERD) is shown in Figure 3. The focus of the migrated data is the data from dw.trx_driver_daily.

Figure 3. Example of ERD from PostgreSQL.

Export Data from On-Premises to Google Cloud Storage

Before we go into data export, we will first discuss connecting to PostgreSQL. You need to have a .pgpass file to store your PostgreSQL credentials. This is done so that you can log in automatically when connecting to PostgreSQL. With this file, you can avoid having to enter your password every time you connect to PostgreSQL. Here is the script content of the .pgpass file.

hostname:port:database:username:password

Note:
hostname : database hostname or IP address of the PostgreSQL server.
port : port number of PostgreSQL server (default is 5432).
database : the name of the database to connect PostgreSQL server.
username : username to connect PostgreSQL server.
password : password to connect PostgreSQL server.

The following is an example of the contents of .pgpass file :

0.0.0.0:5432:mydatabase:user:password

As the .pgpass file contains the login credentials of the PostgreSQL you want to migrate, make sure that the file is readable only by its Owner. You can set the file permissions on Linux with the following command at the prompt.

chmod 600 ~/migration_data/.pgpass

Next, exporting data into .txt format files is done using a template. This template makes it easy for you if you want to export data universally on On-Premises. You can create the template which is saved with the name get_process_file.sh. Here is an example of writing the script.

# Configuration of PostgreSQL Connection
# Example
PG_HOST="0.0.0.0"
PG_PORT="5432"
PG_DBNM="mydatabase"
PG_USER="user"

# SQL Query to Export Data to CSV Format
QUERY="COPY ($1) TO STDOUT WITH (FORMAT CSV, HEADER TRUE, DELIMITER E'\t', FORCE_QUOTE *)"

# Export PostgreSQL Password
# Example
export PGPASSFILE="/.pgpass"

# Execute Script
psql -h $PG_HOST -p $PG_PORT -d $PG_DBNM -U $PG_USER -c "$QUERY" -t -A > "$2/$3"
wait
gsutil mv "$2/$3" $4

Note:
$1 : Query (example: "SELECT * FROM dw.trx_driver_daily")
$2 : On-Premise or Local Folder (example: "/migration_data/result")
$3 : Output Filename (example: "trx_driver_daily.txt")
$4 : Bucket/Folder in GCS (example: "gs://migration_data_result/")

The delimiter used is '\t' or tab. The template also adds FORCE_QUOTE to prevent data in one column from being entered. Next, you can run the script in cmd or PowerShell that has the Google Cloud SDK installed. The command script can be customized according to the data you want to export. Here is an example of writing the command.

sh get_process_file.sh "SELECT * FROM dw.trx_driver_daily" "/migration_data/result" "trx_driver_daily.txt" "gs://migration_data_result/"

Next, you can check the file in your destination Bucket and Folder GCS. If it is successful, we can move on to the next step.

Extract, Transform, and Load Data into Google Cloud BigQuery

This is the last step which is to load the data from GCS into BigQuery. The final goal is to load the data into the DW Table in BigQuery. The way to connect data from .txt format files in GCS and BigQuery is to create an External Table in BigQuery based on the results of extracting txt format files from GCS. This is to make time efficiency and facilitate the integration between GCS and BigQuery.

External Table in BigQuery is conditioned with all its columns having STRING data type and as a temporary source table before transforming and loading data to DW Table. Here is an example of its application to dw.trx_driver_daily. The display is shown in Figure 4.

-- Example of extract for table trx_driver_daily
CREATE EXTERNAL TABLE tmp.trx_driver_daily (
trx_id STRING,
trx_date STRING,
order_date STRING,
finished_date STRING,
from_address STRING,
from_latitude STRING,
from_longitude STRING,
to_address STRING,
to_latitude STRING,
to_longitude STRING,
mode STRING,
distance STRING,
amount_delivery STRING,
customer_id STRING,
driver_id STRING,
vehicle STRING,
vehicle_brand STRING,
load_time STRING)
OPTIONS (
format = "CSV",
uris = ["gs://migration_data_result/trx_driver_daily.txt"],
skip_leading_rows = 1,
allow_quoted_newlines = True,
allow_jagged_rows = True,
field_delimiter = "\t");
Figure 4. The display if data extraction from GCS is successful.

Next is the transform and load process from External Table to DW Table. This process adjusts the data type and data compatibility in BigQuery and On-Premises. The dw.trx_driver_daily data type in PostgreSQL can be seen in Figure 3. From the data type, there will be a change in data type from PostgreSQL to BigQuery. You can adjust it to your own project. After this process is complete, your data can be checked for comparison between On-Premises and Cloud. Here is an example of the adjustments made fordw.trx_driver_daily. The result can you show in Figure 5.

-- Example of transform and load for table trx_driver_daily
CREATE TABLE dw.trx_driver_daily AS
SELECT
CAST(trx_id AS INT64) AS trx_id,
PARSE_DATE("%Y-%m-%d", trx_date) AS trx_date,
PARSE_DATETIME("%Y-%m-%d %H:%M:%E*S", order_date) AS order_date,
PARSE_DATETIME("%Y-%m-%d %H:%M:%E*S", finished_date) AS finished_date,
from_address,
CAST(from_latitude AS FLOAT64) AS from_latitude,
CAST(from_longitude AS FLOAT64) AS from_longitude,
to_address,
CAST(to_latitude AS FLOAT64) AS to_latitude,
CAST(to_longitude AS FLOAT64) AS to_longitude,
mode,
CAST(distance AS FLOAT64) AS distance,
CAST(amount_delivery AS FLOAT64) AS amount_delivery,
customer_id,
driver_id,
vehicle,
vehicle_brand,
PARSE_DATETIME("%Y-%m-%d %H:%M:%E*S", load_time) AS load_time
FROM tmp.trx_driver_daily;
Figure 5. Result of Data Migration in BigQuery.

Note:

This data migration implementation is not the best practice for migrating databases. However, if your goal is to minimize costs, this method is one of the most effective ways. This strategy is suitable for small data volumes and requires a stable connection when transferring files from the local server to the cloud. The smaller the migrated data, the faster the data upload process and the lower the risk of damage.

If the volume of data being migrated is very large, it is recommended that the data be split into multiple files based on your desired filters. Alternatively, you can use migration services provided by Google Cloud Platform, such as Google Cloud Database Migration Service (DMS) or Google Cloud Data Transfer Appliance.

For example, if you have a large database size and want to separate it into several files as follows.

sh get_process_file.sh "SELECT * FROM dw.trx_driver_daily WHERE trx_date = '2024-05-01'" "/migration_data/result" "trx_driver_daily_20240501.txt" "gs://migration_data/result/"
sh get_process_file.sh "SELECT * FROM dw.trx_driver_daily WHERE trx_date = '2024-05-02'" "/migration_data/result" "trx_driver_daily_20240502.txt" "gs://migration_data/result/"
sh get_process_file.sh "SELECT * FROM dw.trx_driver_daily WHERE trx_date = '2024-05-03'" "/migration_data/result" "trx_driver_daily_20240503.txt" "gs://migration_data/result/"

You just need to add WHERE or the filter you want in the query and add the filter also in the file name. Furthermore, it is stored in a file with the .sh format, for example collect_files.sh. You can run it with the following script.

sh collect_files.sh

Then when extracting to BigQuery, you can add * to the uris as follows.

...
uris = ["gs://migration_data/result/trx_driver_daily*.txt"],
...

In addition to using BigQuery. BigQuery will calculate the cost based on the amount of data queried and the amount of data processed. Therefore, readers need to be aware of query optimization to be efficient in terms of cost. You can read this link if you want to optimize and cost-efficiency for queries in BigQuery: Unleashing BigQuery’s Potential.

References

--

--