“Connecting the Dots: The Ultimate Guide to Glue Connections for Diverse Sources”

Glue connection for Multiple data sources.

Ram Ambadas Dandale
Ankercloud Engineering
7 min readJan 18, 2024

--

AWS Glue connection for Multiple data sources

In this article, we’re diving into the world of “Glue Connections” — a crucial player in the world of modern data management. Think of it as the tool that brings all your data together, making sure it plays nice and works seamlessly. With information coming from different places, figuring out how to make these connections can make a real difference in how businesses handle their data. So, we’ll break it down — no technical terms, just simple talk about the strategies and tricks that make glue connections a game-changer for making everything run smoother. Let’s get into it!

Accessing AWS Glue Connections

  • In the AWS Management Console, locate the “Find Services” search bar. Type “Glue” and select the AWS Glue service from the results. Click on it to access the console.
  • In the AWS Glue Console, look for the navigation pane on the left. Click on “Connections” under the “Data catalog” section.
AWS Glue Console

Creating Glue Connection

On the “Connections” page, to create a new connection, click the “Add Connection” button.

Step 1 — Choosing Data Source

  • In Glue Connection, effortlessly choose your desired data source (e.g., DocumentDB, OpenSearch Service, Kafka, BigQuery, MongoDB, JDBC, S3, Redshift, SAP HANA, Snowflake, Teradata), making the selection process simple and efficient.
Choosing Data Source

Step 2 — Configure Connection

  • Provide the necessary details for your connection, such as the Database JDBC URL, Username, and Password. Additionally, consider storing your database credentials first in AWS Secrets Manager and then integrating them with the Glue connection for added security.
Configuring Connection Details

Step 3 — Setting up Properties

  • Complete the connection properties, including the connection name and SSL connection settings. Ensure a secure and efficient connection by providing the necessary details.
Setting up Properties

Step 4 — Reviewing and Creating

  • After entering the details, review the connection to ensure it’s configured correctly. Then, click on ‘Create’ to complete the process.
Reviewing and Creating

Step 5 — Testing the connection

  • After creating the connection, we can test it by navigating to ‘Actions’ and then clicking on ‘Test Connection’.
Testing the connection
  • We need to provide the IAM role, which we have to create for Glue. This role will grant Glue the necessary permissions to access AWS resources. Once provided, Glue will attempt to connect to the database and provide a response, indicating success or failure.
Test Connection Response

Multiple Data Source Connection

With a Glue connection, we open the door to smooth data management. By using the Glue connection, we can create a Glue Crawler to neatly organize metadata in the Glue Data Catalog, creating a structured database and table. Now, the Glue connection provides us with the gateway to real data insights by starting to use a Glue Job to transform raw data effortlessly and load it to the target destination (e.g., Data Lake, Data Warehouse).

1). Connecting With MongoDB, MongoDB Atlas and DocumentDB.

To connect with MongoDB or MongoDB Atlas or DocumentDB, we need to provide the MongoDB or MongoDB Atlas or DocumentDB URL in the following format:

MongoDB Connection URL → mongodb://<hostname>:<port>/<database>
DocumentDB Connection URL → mongodb://<hostname>:<port>/<database>
MongoDB Atlas Connection URL → mongodb+srv://<cluster>.mongodb.net/<database>
Port No → 27017
DocumentDB Connection UI

2). Connecting with on-premises Database using JDBC URL.

Connecting to an on-premises database using JDBC (Java Database Connectivity) URL is a fundamental step in establishing a link between our application and the database residing locally. Crafting a JDBC URL by specifying the appropriate protocol, host, port, and database details.

JDBC Connection URL → jdbc:<protocol>://<hostname>:<port>/<database>


Examples

MySQL JDBC Connection URL → jdbc:mysql://mysql.db.server:3306/my_database

Oracle JDBC Connection URL → jdbc:oracle:thin:@//myoracle.db.server:1521/my_servicename

PostgreSQL JDBC Connection URL → jdbc:postgresql://postgresql.db.server:5430/my_database

Microsoft SQL Server JDBC Connection URL → jdbc:sqlserver://mssql.db.server\mssql_instance:1433;databaseName=my_database
JDBC Connection UI

3). Connecting with AWS databases like Aurora, MariaDB, MySQL, Microsoft SQL Server, Oracle Database, and PostgreSQL.

Connecting to AWS SQL databases, including Aurora, MariaDB, MySQL, Microsoft SQL Server, Oracle Database, and PostgreSQL, is a breeze. The database instance, name, and username are automatically fetched, simplifying the connection process for these pre-configured databases on the AWS platform.

AWS SQL Database Connection UI

4). Connect to Amazon OpenSearch Service instances.

To connect to an OpenSearch cluster, provide the endpoint and OpenSearch cluster port number, so we can establish the connection.

OpenSearch Domain Endpoint → https://test-abcdefghijklmnopqrstuvw33xyz.ap-south-1.es.amazonaws.com
Port No → 443
OpenSearch Cluster Connection UI

5). Connect to Amazon Redshift.

To establish a connection with the Redshift data warehouse on the AWS platform, you need to input the cluster name, database name, username, and password. The cluster name can be conveniently selected from a dropdown menu.

Redshift Connection UI

6). Connect to streaming data in Apache Kafka.

For Kafka connections, there are two types: Amazon Managed Streaming for Apache Kafka (MSK) and Customer Managed Apache Kafka. In both cases, Kafka bootstrap server URLs must be provided. If the Kafka cluster is on another AWS account, Amazon Managed Streaming for Apache Kafka (MSK) allows connection. However, if the Kafka cluster is not on the AWS platform, Customer Managed Apache Kafka should be selected.

Kafka bootstrap server URLs :→
b-1.vpc-test-2.o4q88o.c6.kafka.ap-south-1.amazonaws.com:9094,
b-2.vpc-test-2.o4q88o.c6.kafka.ap-south-1.amazonaws.com:9094,
b-3.vpc-test-2.o4q88o.c6.kafka.ap-south-1.amazonaws.com:9094
Kafka Connection UI

7). Connect to Azure Cosmos DB [NoSQL].

For connection, we need to provide the Azure Cosmos DB Account Endpoint URI. The username and password can be stored in the secret manager, allowing selection from a dropdown menu to establish the connection.

Cosmos DB Endpoint URI → https://ResourceName.documents.azure.com:443/
Cosmos DB Connection UI

8). Connect to Azure SQL Database.

Connecting requires the Azure SQL Server URL, with the option to store the username and password securely in the secret manager. This facilitates an easy connection setup by selecting the credentials from a dropdown menu.

Azure SQL Server URL → jdbc:sqlserver://mssql.db.server\mssql_instance:1433;databaseName=my_database
Azure SQL Server URL Connection UI

9). Connect to Google BigQuery.

For connecting to Google BigQuery, the credentials must be added to the secret manager first, and subsequently selected from a dropdown menu for the connection.

Google BigQuery Connection UI

10). Configure network access for Glue.

To connect to a database on a different network, we need to provide the VPC, subnet, and security group associated with that database. Here, we do not need to provide a connection URL, username, and password; these details can be supplied while reading data in a Glue job.

Network Connection UI

11). Connect to SAP HANA Database.

For connecting to the SAP HANA Database, the credentials must be added to the secret manager first, and subsequently selected from a dropdown menu for the connection.

SAP HANA URL → jdbc:sap://exampleDatabaseServer:30015/?databaseName=sampleDatabase
SAP HANA Database Connection UI

12). Connect to Snowflake.

Establishing a connection with Snowflake involves initially adding the credentials to the secret manager. Following this, you can conveniently select the credentials from a dropdown menu to complete the connection.

Snowflake URL → jdbc:snowflake://<account_name>.snowflakecomputing.com/?db=<database_name>&warehouse=<warehouse_name>&role=<role_name>&schema=<schema_name>
Snowflake Connection UI

13). Connect to Teradata Vantage.

Connecting to Teradata requires an initial step of adding credentials to the secret manager. Subsequently, you can easily choose the necessary credentials from a dropdown menu to finalize the connection.

Teradata URL → jdbc:teradata://myserver.com/mydatabase, TMODE=TERADATA
Teradata Connection UI

14). Connect to Vertica.

Establishing a connection with Vertica begins by first storing the credentials in the secret manager. Afterward, you can conveniently select the required credentials from a dropdown menu to complete the connection.

Vertica Hostname → Public IP or hostname of our Vertica Server.
Port → 5433.
Vertica Connection UI

If you’re interested in exploring ways to write Glue jobs beyond Glue connections, I’ve written a separate blog on that topic. You can check it out here.

Conclusion

In this blog, we explored the process of establishing a Glue connection, emphasizing the multifaceted nature of connecting with various data sources. We delved into the specific requirements for each data source, including crucial details such as hostname, port, cluster name, instance name, username, and password. Additionally, we addressed the necessary permissions required for the IAM role, enabling Glue to seamlessly read data from the specified data sources.

--

--