“Connecting the Dots: The Ultimate Guide to Glue Connections for Diverse Sources”
Glue connection for Multiple data sources.
In this article, we’re diving into the world of “Glue Connections” — a crucial player in the world of modern data management. Think of it as the tool that brings all your data together, making sure it plays nice and works seamlessly. With information coming from different places, figuring out how to make these connections can make a real difference in how businesses handle their data. So, we’ll break it down — no technical terms, just simple talk about the strategies and tricks that make glue connections a game-changer for making everything run smoother. Let’s get into it!
Accessing AWS Glue Connections
- In the AWS Management Console, locate the “Find Services” search bar. Type “Glue” and select the AWS Glue service from the results. Click on it to access the console.
- In the AWS Glue Console, look for the navigation pane on the left. Click on “Connections” under the “Data catalog” section.
Creating Glue Connection
On the “Connections” page, to create a new connection, click the “Add Connection” button.
Step 1 — Choosing Data Source
- In Glue Connection, effortlessly choose your desired data source (e.g., DocumentDB, OpenSearch Service, Kafka, BigQuery, MongoDB, JDBC, S3, Redshift, SAP HANA, Snowflake, Teradata), making the selection process simple and efficient.
Step 2 — Configure Connection
- Provide the necessary details for your connection, such as the Database JDBC URL, Username, and Password. Additionally, consider storing your database credentials first in AWS Secrets Manager and then integrating them with the Glue connection for added security.
Step 3 — Setting up Properties
- Complete the connection properties, including the connection name and SSL connection settings. Ensure a secure and efficient connection by providing the necessary details.
Step 4 — Reviewing and Creating
- After entering the details, review the connection to ensure it’s configured correctly. Then, click on ‘Create’ to complete the process.
Step 5 — Testing the connection
- After creating the connection, we can test it by navigating to ‘Actions’ and then clicking on ‘Test Connection’.
- We need to provide the IAM role, which we have to create for Glue. This role will grant Glue the necessary permissions to access AWS resources. Once provided, Glue will attempt to connect to the database and provide a response, indicating success or failure.
Multiple Data Source Connection
With a Glue connection, we open the door to smooth data management. By using the Glue connection, we can create a Glue Crawler to neatly organize metadata in the Glue Data Catalog, creating a structured database and table. Now, the Glue connection provides us with the gateway to real data insights by starting to use a Glue Job to transform raw data effortlessly and load it to the target destination (e.g., Data Lake, Data Warehouse).
1). Connecting With MongoDB, MongoDB Atlas and DocumentDB.
To connect with MongoDB or MongoDB Atlas or DocumentDB, we need to provide the MongoDB or MongoDB Atlas or DocumentDB URL in the following format:
MongoDB Connection URL → mongodb://<hostname>:<port>/<database>
DocumentDB Connection URL → mongodb://<hostname>:<port>/<database>
MongoDB Atlas Connection URL → mongodb+srv://<cluster>.mongodb.net/<database>
Port No → 27017
2). Connecting with on-premises Database using JDBC URL.
Connecting to an on-premises database using JDBC (Java Database Connectivity) URL is a fundamental step in establishing a link between our application and the database residing locally. Crafting a JDBC URL by specifying the appropriate protocol, host, port, and database details.
JDBC Connection URL → jdbc:<protocol>://<hostname>:<port>/<database>
Examples
MySQL JDBC Connection URL → jdbc:mysql://mysql.db.server:3306/my_database
Oracle JDBC Connection URL → jdbc:oracle:thin:@//myoracle.db.server:1521/my_servicename
PostgreSQL JDBC Connection URL → jdbc:postgresql://postgresql.db.server:5430/my_database
Microsoft SQL Server JDBC Connection URL → jdbc:sqlserver://mssql.db.server\mssql_instance:1433;databaseName=my_database
3). Connecting with AWS databases like Aurora, MariaDB, MySQL, Microsoft SQL Server, Oracle Database, and PostgreSQL.
Connecting to AWS SQL databases, including Aurora, MariaDB, MySQL, Microsoft SQL Server, Oracle Database, and PostgreSQL, is a breeze. The database instance, name, and username are automatically fetched, simplifying the connection process for these pre-configured databases on the AWS platform.
4). Connect to Amazon OpenSearch Service instances.
To connect to an OpenSearch cluster, provide the endpoint and OpenSearch cluster port number, so we can establish the connection.
OpenSearch Domain Endpoint → https://test-abcdefghijklmnopqrstuvw33xyz.ap-south-1.es.amazonaws.com
Port No → 443
5). Connect to Amazon Redshift.
To establish a connection with the Redshift data warehouse on the AWS platform, you need to input the cluster name, database name, username, and password. The cluster name can be conveniently selected from a dropdown menu.
6). Connect to streaming data in Apache Kafka.
For Kafka connections, there are two types: Amazon Managed Streaming for Apache Kafka (MSK) and Customer Managed Apache Kafka. In both cases, Kafka bootstrap server URLs must be provided. If the Kafka cluster is on another AWS account, Amazon Managed Streaming for Apache Kafka (MSK) allows connection. However, if the Kafka cluster is not on the AWS platform, Customer Managed Apache Kafka should be selected.
Kafka bootstrap server URLs :→
b-1.vpc-test-2.o4q88o.c6.kafka.ap-south-1.amazonaws.com:9094,
b-2.vpc-test-2.o4q88o.c6.kafka.ap-south-1.amazonaws.com:9094,
b-3.vpc-test-2.o4q88o.c6.kafka.ap-south-1.amazonaws.com:9094
7). Connect to Azure Cosmos DB [NoSQL].
For connection, we need to provide the Azure Cosmos DB Account Endpoint URI. The username and password can be stored in the secret manager, allowing selection from a dropdown menu to establish the connection.
Cosmos DB Endpoint URI → https://ResourceName.documents.azure.com:443/
8). Connect to Azure SQL Database.
Connecting requires the Azure SQL Server URL, with the option to store the username and password securely in the secret manager. This facilitates an easy connection setup by selecting the credentials from a dropdown menu.
Azure SQL Server URL → jdbc:sqlserver://mssql.db.server\mssql_instance:1433;databaseName=my_database
9). Connect to Google BigQuery.
For connecting to Google BigQuery, the credentials must be added to the secret manager first, and subsequently selected from a dropdown menu for the connection.
10). Configure network access for Glue.
To connect to a database on a different network, we need to provide the VPC, subnet, and security group associated with that database. Here, we do not need to provide a connection URL, username, and password; these details can be supplied while reading data in a Glue job.
11). Connect to SAP HANA Database.
For connecting to the SAP HANA Database, the credentials must be added to the secret manager first, and subsequently selected from a dropdown menu for the connection.
SAP HANA URL → jdbc:sap://exampleDatabaseServer:30015/?databaseName=sampleDatabase
12). Connect to Snowflake.
Establishing a connection with Snowflake involves initially adding the credentials to the secret manager. Following this, you can conveniently select the credentials from a dropdown menu to complete the connection.
Snowflake URL → jdbc:snowflake://<account_name>.snowflakecomputing.com/?db=<database_name>&warehouse=<warehouse_name>&role=<role_name>&schema=<schema_name>
13). Connect to Teradata Vantage.
Connecting to Teradata requires an initial step of adding credentials to the secret manager. Subsequently, you can easily choose the necessary credentials from a dropdown menu to finalize the connection.
Teradata URL → jdbc:teradata://myserver.com/mydatabase, TMODE=TERADATA
14). Connect to Vertica.
Establishing a connection with Vertica begins by first storing the credentials in the secret manager. Afterward, you can conveniently select the required credentials from a dropdown menu to complete the connection.
Vertica Hostname → Public IP or hostname of our Vertica Server.
Port → 5433.
If you’re interested in exploring ways to write Glue jobs beyond Glue connections, I’ve written a separate blog on that topic. You can check it out here.
Conclusion
In this blog, we explored the process of establishing a Glue connection, emphasizing the multifaceted nature of connecting with various data sources. We delved into the specific requirements for each data source, including crucial details such as hostname, port, cluster name, instance name, username, and password. Additionally, we addressed the necessary permissions required for the IAM role, enabling Glue to seamlessly read data from the specified data sources.