Loading on-prem Postgres data into Amazon S3 with server side filtering

Published in

awsblackbelt

5 min readMar 10, 2023

Many customers want to migrate or analyse their on-prem data into the cloud. Most of the time they also want to rely on their known load behaviors like daily Extract, Transform and Load (ETL) jobs (and not a real-time replication). With AWS Glue it is possible to create data load jobs that can run on a schedule and perform the necessary data extraction of on-prem data.

Within the AWS Glue in a Nutshell Story you can find a detailed description what AWS Glue really is.

In this scenario it is necessary to perform a daily load, and more important, to perform a filtering that needs to run on the database server as the size of the table is very large. A filtering on AWS side would mean higher traffic and more cost in the cloud. We analyzed different options and came up with an easy and cost effective solution.

Solution with AWS Glue (partially via Visual Editor)

Glue offers a pretty new and intuitive way to load data on a schedule using the Visual Editor in the Glue Studio. Although it is the most obvious solution it comes with two obstacles that you may not expect at the beginning:
1. The Visual Editor won’t allow you to connect with Postgres using username/password authentication
2. The Visual Editor won’t allow you to perform a server side filtering

In our final solution we would make use of the advantages of the Visual Editor before we switch to the script and make adjustments there to overcome the obstacles named above. In the Visual Editor we add all the necessary source, targets and actions. We load the data using the Glue catalog table (that has been crawled before). An SQL Query transformation will add the current timestamp to the Data Frame so that can be analyzed later. Eventually we will write to two targets in parallel. One target writes to S3 with a parquet format and does a partitioning. The other target will provide data via CSV and not partitioned.

After doing this, the Visual Editor will look like this:

If we would start this job without any adjustmens we will get the following error:

An error occurred while calling o89.getCatalogSource. None.get

Thats why we will need to edit the existing script by cloning the job in the first place. Later we would only copy the script after making changes in the Visual editor to the already cloned job. To address the issue we will change the source script from this:

TestDB_node123456789 = glueContext.create_dynamic_frame.from_catalog(
 database="test-db",
 table_name="database_schema_data_itemised",
 transformation_ctx="TestDB_node123456789",
)

to this (username and password can be obtained by using Secrets Manager):

TestDB_node123456789 = glueContext.create_dynamic_frame.from_options ( 
 connection_type="postgresql",
 connection_options = {
 "url": "jdbc:postgresql://<server>:<port>/<database>",
 "user": username,
 "password": password,
 "dbtable": "schema.data_itemised"
 }
)

After making this change we can load all the data from the source table which can take some time and cost if the table is big. Thats why we will also add the pre filtering to the source which can be done using the `sampleQuery` option. Here is the code for it:

TestDB_node123456789 = glueContext.create_dynamic_frame.from_options ( 
 connection_type="postgresql",
 connection_options = {
 "url": "jdbc:postgresql://<server>:<port>/<database>",
 "user": username,
 "password": password,
 "dbtable": "schema.data_itemised",
 "sampleQuery":"SELECT * FROM schema.data_itemised WHERE custom_attribute_1 = 'K' AND session_end BETWEEN (current_date - INTERVAL '1 day')::date AND (current_date)::date"
 }
)

Solution Alternatives

Alternative 1 — AWS Database Migration Service (DMS)

DMS gives the option to perform a replication or a full load into AWS. A (continuous) replication would create multiple files during the day and cause unnecassary cost as the replication instance would need to run 24/7. We could also perform a full load together with a filter transformation, but this would make it necessary to create a schedule with EventBridge as well as some cost optimization in the form of dynamic replication instance creation. A step function is necessary to create DMS resources, run the full load and destroy resources at the end. Also this approach lacks flexibility when it comes to scaling. The necessary instance size needs to be known beforehand.

Alternative 2 — AWS Glue DataBrew recipes

DataBrew gives you an easy and visual tool to develop data transformation jobs, but it lacks the possiblity to pre-filter data. You can connect to your on-prem database by using the Data catalog but it will always load the whole table into DataBrew before it performs filters.

Alternative 3 — Amazon Athena Federated Query

Athena allows you to make queries directly to on-premise databases as well as other data sources. This alternative has not been tested yet. Look here for more information and let us know if you have tested this alternative :-)

Further Serverless data processing is described within our Serverless Data Processing on AWS Story.

Using Amazon Athena Federated Query

If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or…

docs.aws.amazon.com

About the Authors:

Sven Leiß

My name is Sven Leiss and I am an 5x certified AWS enthusiast and AWS Migration Blackbelt. I have been working in the AWS space for the past 7 years and have extensive knowledge of the AWS platform and its various services. I am passionate about helping customers get the most out of the cloud and have a great track record of successful implementations.

Robert Rößler

My name is Robert Rößler and I am an 3x certified AWS enthusiast. I have been working in the AWS space for the past 4 years and have extensive knowledge of the AWS platform and its various services. I am passionate about helping customers get the most out of their data migration and transformation journey into AWS.

We have extensive experience in designing and implementing cloud architectures using AWS services such as EC2, S3, Lambda and more. Our team is well versed in Data and Analytics, DevOps and AWS cloud migration journeys.

If you are looking for experienced AWS experts, we would be more than happy to help. Feel free to contact us to discuss your cloud needs and see how we can help you get the most out of the AWS cloud.