Analytical insights on heterogenous data sources in AWS Ecosystem

Subhash Burramsetty
4 min readSep 10, 2020

--

Introduction

In today’s modern world, while developing web applications and micro-services there are various kinds of databases available to choose from to suit the needs of application in a right way based on the use case we are dealing with. Often the data might be split across multiple places and makes it bit complex and difficult to merge data between those multiple databases to generate insights on a near-real time basis.

Scenario

Assume you have to generate insights on top of data stored in databases like DynamoDB, DocumentDB which gets updated very frequently and analytical data stored on top of S3 in parquet format.

  1. What would be your approach if you want to query on top of all these various types of data stores and join them using SQL like constructs to make life of Data Analysts easier?
  2. What would be your approach if you have to generate charts either in a BI tool dashboard that has to be powered up by merging the data present in various data stores?
  3. What would be your approach if the changes in DynamoDB or DocumentDB data has to be reflect in your BI dashboards easily?
  4. What would be your approach if you have to achieve the above use case in a Serverless fashion with least overhead and without buying additional connectors from third party providers?

Here comes the AWS Athena Federated Query to the rescue.

Athena Federated Query

Athena Federated Query(Preview) allows you to run SQL queries across data stored in relational, non-relational, key-value, document and custom data sources. It uses data source connectors that run on Lambda behind the scenes to run federated queries. It helps users to JOIN data across data stores easily using SQL constructs.

Athena Federated Query Ecosystem

Athena Federated Query uses data connectors to interact with various data sources. Each data connector is unique to a specific data source and are deployed as Lambda functions. When a federated query is run on a data source, Athena invokes multiple lambda functions behind the scenes to fanout, fetch metadata and data with high parallelism and uses S3 buckets as spill to merge the data.

Based on the data sources and connectors, it also supports the following features to some extent:

  • Partitioned Pruning
  • Column Projection
  • Predicate Pushdown
  • Limited Scans
  • Congestion Control
  • Parallelized and Pipeline reads
  • AWS Secrets Manager Integration
  • Glue DataCatalog Support
  • Federated Scalar Batch Functions

Currently AWS has written integrations with more than 20 databases and storage formats. Athena Federated Query provides following ready made connectors to connect with the following data sources:

  1. Redis
  2. ElasticSearch
  3. DynamoDB
  4. DocumentDB
  5. MySQL
  6. PostgreSQL
  7. HBase
  8. Redshift
  9. Cloudwatch Logs
  10. Cloudwatch Metrics
  11. AWS CMDB Connector (to communicate with various AWS services like EC2, EMR etc)

You can also modify the above available connectors using code from the AWSLabs Github repository to add additional functionality like UDFs and redeploy it in AWS to use with Athena.

The above list are for starters and we can expect AWS to provide more connectors in the future. In the mean time, if you have to connect to a data sources which is not available in the above list, you also have the option to create own custom data connectors to connect with other data sources.

Constraints and Limitations

  1. You cannot use views with the federated data sources.
  2. Athena Federate query uses lambda invocations and the maximum number of parallel lambda invocations depends on the Lambda concurrency limits enforced in your account. If multiple queries at running at a time, all of them are capped due to concurrency limits.
  3. Athena query has a max timeout of 30 minutes. But the lambda used by Federated queries behind the scenes come with a max timeout of 15 minutes and maximum memory of 3GB. As long as the source system that you are federating to supports partitioning or parallel scans, Athena will use multiple invocations to extend max runtime and memory available to connector.
  4. If your datasource is present inside a VPC, then you need to make sure to deploy connectors as Lambda functions in appropriate VPC.

Summary

Athena Federated Query is one of the amazing feature which allows data analysts and engineers to perform analytical insights on heterogenous data sources using SQL constructs and provides great integration with other services in the AWS ecosystem.

References

  1. https://aws.amazon.com/blogs/big-data/query-any-data-source-with-amazon-athenas-new-federated-query/
  2. https://github.com/awslabs/aws-athena-query-federation
  3. https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk
  4. https://github.com/awslabs/aws-athena-query-federation/wiki/FAQ

--

--