AWS Glue — ETL Microsoft SQL Server to AWS Aurora — MySQL

Krishna Sadasivam
4 min readDec 29, 2019

--

AWS Glue can be used to extract, transform and load the Microsoft SQL Server (MSSQL) database data into AWS Aurora — MySQL (Aurora) database. AWS Glue has three core components: Data Catalog, Crawler, and ETL job.

Data Catalog: Servers as a metadata repository. It contains the schema of the data store and does not contain the actual data from the data store.

Crawler: It is used to populate the Data Catalog with metadata of the data store (source or target). When the crawler runs, it scans the data store data to determine the format, schema and other properties of data and writes that information as metadata into Data Catalog.

ETL job: It contains the business logic to perform data processing. It is made up of transformation scripts (Python or Scala only), connection info to data source/target, optional job arguments and parameters.

Overview:

The Crawler establishes a connection with the MSSQL database and scans the tables to find the table schema and properties. Then, the MSSQL table schema and properties are written as metadata into the Data Catalog. The ETL job uses the MSSQL metadata information present in Data Catalog to define/author the transformation logic. On a job run, the ETL job establishes a connection with the MSSQL database and extracts the data, and performs the transformation logic. The transformation results are temporarily stored in S3 if needed before bulk loaded into the Aurora database.

Populating data catalog:

The MSSQL database and Aurora database are hosted in AWS. Connections to these databases are established using JDBC drivers. The Glue Crawler establishes connections with MSSQL/Aurora database. Since the source/target is a relation database, the Crawler uses built-in classifiers to infer the format and schema of the data present in the tables. The inferred format and schema of the data is written to the Data Catalog.

Authoring ETL job:

The ETL job contains the business logic to transform the MSSQL data into a data format known to the Aurora database. In order to author the job, the source tables and target tables must be selected from the Data Catalog along with the database connection information. The target tables are optional, and if it's missing, the job will create the target tables. The job also accepts additional arguments/parameters that might be needed for the job processing environment. To perform the data processing, Glue provides a set of built-in transform constructs. Multiple transforms can be coined together and the data passes from one transform to another. The data is passed in a data structure called DynamicFrame. DynamicFrame is an extension to Spark SQL (similar to pandas DataFrame). A few important built-in transforms provided by Glue are: ApplyMapping, Filter, Join, Map, ResolveChoice, SplitRows. Finally, the authored job is saved to the S3 bucket and later loaded into the job-processing environment during job execution.

ETL job execution:

Triggers are used to start the ETL job execution. The trigger conditions can be on-demand or scheduled. On a trigger, the ETL job starts its execution by preparing an Apache Spark execution environment. If job parameters/arguments are provided, then the arguments are loaded from the S3 bucket into the execution environment. Next, the job passes the Glue context and Spark context to the job script and it executes the script. The script makes a connection to the MSSQL database, pulls the source table data, passes the data onto the transforms, stores the intermediate results into an S3 bucket before finally loading the processed data into the Aurora database.

Architecture:

The following is needed to complete the AWS Glue ETL system.

  1. VPC
  2. Subnet
  3. Security Group
  4. Route table
  5. NAT Gateway
  6. Internet Gateway
  7. VPC peering
  8. S3 VPC endpoint

By default, all AWS Glue execution instances use private IP addresses when it creates ENI in the specified VPC/subnet. Internet Gateway is used to connect to internet/public addresses outside of AWS. Route tables are used to route the traffic within VPC, across private/public subnets. VPC uses security groups as virtual firewalls to block/unblock in-bound/out-bound traffic based on HTTP protocol/port.

Since the Aurora database and AWS Glue are in the same AWS account, VPC peering is used to establish the connection instead of sending the traffic over the internet gateway. Similarly, the traffic from AWS Glue to S3 is established via VPC endpoints.

--

--