Get started with Athena federated queries and DocumentDB.
One of the main challenges in today’s architecture is obtaining insights from multiple data sources. In these architectures, it is pretty common to see a combination of different databases: in memory, key-value, document, relational, columnar, and many more. Athena is a query engine that allows us to interact and analyze structured, semi-structured, and unstructured data. With federated queries from Athena, we can now query all these databases and store their results in S3.
This post will show you how to configure federated Athena queries to interact with multiple DocumentDB clusters and centralize all the information for a more detailed analysis.
In this article, we will discuss the below points:
1- How to create the Athena connector
2- Lambda Connectivity Function
3- Deployments of our Lambda Function
4- Queries in Athena
5- Learned lessons
How to create the Athena connector
Next, I show you the connector steps; first, we must go to the Athena console.
1- Click on Data Sources
2- You should select the Query a data source option and select the origin database for your use case; in my case is Amazon DocumentDB.
3- In this step, we configure the new Lambda function to connect with our database. It will take us to configure the function with the environment variables specified later to achieve connectivity.
Lambda Connectivity Function
To create our lambda function, we need to fill in the specified fields, which I will explain each one:
Configuration Part # 1
- Application name: Name the application in SAM, and you could see it in Lambda > Applications.
- SecretNameOrPrefix: When the connection strings to the database are in the Secrets Manager service, we can specify either a prefix of the secret or the name.
- spill_bucket: Bucket that will store the results of the queries; these will remain in CSV. In the picture below, you can see an example.
- AthenaCatalogName: is the name of the Athena catalog. The Lambda function will have the same name. In the picture below, you can see an example.
- disable_spill_encryption: It can be True or False; it is to encrypt the data stored in the S3 bucket; keep in mind that you want better performance.
Configuration Part # 2
- DocDBConnectionString: In this field, we need to specify the connection string to our database or refer to the connection string's secret. If you are going to use a secret, keep in mind that the role of the Lambda function must have the necessary permissions to get it and decrypt it.
- LambdaMemory: Memory of the lambda function, keep in mind that many queries to the source DB can be complex and require execution time. For this reason, it is essential to review the function and its consumption to adjust the memory.
- LambdaTimeout: It is the timeout configured for the function. It is crucial to monitor the Lambda in production. However, I suggest starting with a high value. In this case, I started with 900. See the following logs when we query a small table in DocumentDB.
REPORT RequestId: f94b7b28–12e4–44b9-b0ee-84aca25bbc2c Duration: 417.61 ms Billed Duration: 418 ms Memory Size: 3008 MB Max Memory Used: 201 MBREPORT RequestId: 00a578c2-c043–491e-8acc-cd497bac801d Duration: 9637.66 ms Billed Duration: 9638 ms Memory Size: 3008 MB Max Memory Used: 195 MBREPORT RequestId: 6067415d-94dc-45d4-a9ed-dc22cef1a465 Duration: 8174.79 ms Billed Duration: 8175 ms Memory Size: 3008 MB Max Memory Used: 192 MB
- SecurityGroupIds: You must specify the ID of the security group that we will associate with our role.
- spill_prefix: It is the name of the subfolder in the bucket where the responses will be stored.
- SubnetIds: You need to specify the ID of the security group that we will associate with our role.
The Lambda function must be in a VPC; this is important because we must take into account several components and their configuration:
- Security Groups: it is essential in this case that the SG of the DocumentDB database allows the traffic of the SG of the Lambda function through port 27017.
- Routing Table: We have to decide how our Lambda function will access the S3 bucket to store the data.
- Lambda to S3 over the internet: For this to happen, the Lambda must be in a private subnet, and its routing table must send the traffic to the internet through a NAT Gateway.
- Lambda to S3 without going online: We need to create a VPC endpoint to access S3 and direct the traffic to S3 through this endpoint in the routing table. For this scenario, Lambda does not need to go online.
- Language: Java 8.
- Deployment: The Lambda function is deployed using SAM; this process is automatic once we fill in all the requested information.
- Role: The role of Lambda function must have access to the S3 to store the query results.
- Size: The lambda package weighs approximately 25Mb.
- Code: The lambda code is generated automatically. When we try to see the code, we can not do it, and the code comes from a bucket outside the account in the SAM Template.
Deployments of our Lambda Function
To make changes to our Lambda function, we can create a CICD with AWS tools like CodePipeline, CodeBuild, and CodeDeploy. AWS behind the scenes creates the entire infrastructure with a SAM template like the following:
To create a CICD for a serverless application, you can look at the following link. However, this is the first version of the application deployment in the console.
Queries in Athena
After doing all the necessary configuration, in Athena, we can consult our databases in DocumentDB as follows:
SELECT * FROM “source_data_name”. “Data_base”. “Table” limit 10;
From Athena, you can also query multiple data sources from different database engines. For more information, see the Athena Federated Query Documentation.
- It is an outstanding feature; however, you must create its connector for each connection to a different DB; this will give you several Lambda functions depending on the amount of data sources.
- You can integrate Athena with Quicksight to create visual dashboards with the required information.
- The security of architecture like the one shown is essential; that is why it is mandatory to secure all the components, guarantee encryption at rest, limit the permissions of the Lambda Role and always use Secrets Manager to connect to the data source.
- When configuring this feature for the first time, it is vital to always look at the execution logs of the Lambda function and, based on these results, adjust the Memory and timeout values.