Advanced Serverless Techniques IV: AWS Athena for Serverless Data Analysis

9 min readFeb 15, 2024

The ability to swiftly analyze events can unlock invaluable insights for businesses, driving better decision-making and enhancing customer experiences. This article dives into how integrating AWS Athena with serverless architectures provides a robust, scalable solution for dissecting vast volumes of payment data. By leveraging the serverless power of AWS combined with the analytical prowess of Athena, we explore a seamless pathway from raw event data to actionable business intelligence, setting a new standard for payment analytics efficiency and effectiveness.

In the rapidly digitalizing economy, businesses across sectors are facing an increasing need to process, analyze, and draw insights from their transactional data in real time. This capability is not just a competitive advantage but a necessity for informed decision-making, customer satisfaction, and operational efficiency. The solution outlined in this article caters to these critical business requirements by providing a framework for analyzing payment event data.

Scenarios Where This Solution Shines

Financial Reporting: Financial teams require up-to-date information to monitor cash flow, track revenue, and manage budgets effectively. By analyzing payment events as they occur, companies can generate real-time financial reports, enabling swift responses to financial trends and anomalies. This scenario underscores the need for a system that can not only handle large volumes of data but also provide insights almost instantaneously.
Customer Behavior Analysis: Understanding how, when, and why customers make payments is crucial for tailoring marketing strategies, improving customer experiences, and optimizing product offerings. This solution allows businesses to aggregate payment data by various dimensions, such as time, product, or customer segment, revealing patterns that can inform strategic decisions. For instance, identifying peak hours for payments can help in resource allocation for customer support, while understanding popular payment concepts can guide product development and promotional efforts.

Financial Reporting

DynamoDB Table Design for Payment Events

Consider a table named PaymentEvents that tracks each payment transaction. Here's a suggested schema:

Partition Key: paymentId (string) - A unique identifier for each payment.
Sort Key: eventTimestamp (string or number) - The timestamp when the event occurred.

Additional Attributes:

amount (number) - The amount of the payment.
concept (string) - The reason or concept for the payment.
createdBy (string) - Identifier of who initiated the payment.
updatedBy (string) - Identifier of who last updated the payment record.
status (string) - The current status of the payment (e.g., completed, pending, cancelled).
currency (string) - The currency of the payment amount.
type (string) — Credit card, cash, debit card

Aggregating and Storing Data in S3

Before you can query your data with Athena, you need to aggregate your DynamoDB data and store it in Amazon S3, as discussed previously. Assume this data is stored in JSON format in S3, with a file structure organized by date (e.g., payments/2024-02-15/payments.json).

Creating the bucket

You can create a new bucket in your serverless.yml . The following code will create 2 buckets and add a reference INFRA_BUCKET to the env variables process.env.INFRA_BUCKET .

provider:

  environment:
    INFRA_BUCKET: ${self:custom.infraBucket}

  iamRoleStatements:
    - Effect: Allow
      Action:
        - s3:actions # actions you need
      Resource:
        - arn:aws:s3:::${self:custom.infraBucket}
        - arn:aws:s3:::${self:custom.infraBucket}/*

custom:
  infraBucket: infra-${self:custom.stage}-uvt

Resources:
  InfraBucket:
    Type: 'AWS::S3::Bucket'
    Properties:
      BucketName: ${self:custom.infraBucket}

Daily Payments Backup

You can schedule an AWS Lambda that runs everyday at X time.

  backup-today-payments:
    handler: functions/backup-today-payments.handler
    memorySize: 1600
    timeout: 120
    events:
      - schedule:
          name: infra-${self:custom.stage}-backup-today-payments
          description: 'Scheduled event to backup today payments every day at 5 PM PST'
          rate: cron(0 1 * * ? *) # 4 PM PST is 1 AM UTC
          enabled: true

The AWS Lambda function will make a query to get today’s payments, and will store it in the right bucket.

Getting today’s date with the right format:

// Friendly date is the date in the format YYYY-MM-DD
const date = new Date();
const options = { timeZone: 'America/Los_Angeles', year: 'numeric', month: '2-digit', day: '2-digit' };
const pstDate = date.toLocaleDateString('en-US', options);
const friendlyDate = pstDate.split('/').reverse().join('-');

Querying the items:

const {Items: payments} = await dynamoClient.dynamodbQueryHelper({
            TableName: TesoreriaTableName,
            HashKeyName: "friendlyDate",
            HashKeyValue: friendlyDate,
            IndexName: FriendLyDateIndex,
        });

Creating the JSON and updating it to the S3 Bucket:

// 2. Create JSON file with today's payments
        const paymentsJson = JSON.stringify(payments);

        // 3. Upload JSON file to S3 bucket
        const Key = `payments/${friendlyDate}.json`;
        await s3Client.putObject({
            Bucket,
            Key,
            Body: paymentsJson,
        });

Setting Up AWS Athena

Create a Database: In Athena, start by creating a new database to organize your tables. You can do this from the Athena query editor in the AWS Management Console.
Define the Table Schema: Next, you need to define a table that matches the structure of your JSON data in S3. Here’s how you might define the schema in Athena for the PaymentEvents data:

CREATE EXTERNAL TABLE IF NOT EXISTS payment_events (
  paymentId string,
  eventTimestamp string,
  amount double,
  concept string,
  createdBy string,
  updatedBy string,
  status string,
  currency string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  "serialization.format" = "1"
) LOCATION 's3://your-bucket-name/payments/'

Replace your-bucket-name with the name of your S3 bucket.

Query Examples in AWS Athena

Once your data is in S3 and you’ve set up your Athena table, you can start querying your data. Here are some example queries that match your requirements:

Total Payments Grouped by Concept:

SELECT concept, COUNT(*) AS total_payments, SUM(amount) AS total_amount
FROM payment_events
GROUP BY concept

Payments Grouped by Creator:

SELECT createdBy, COUNT(*) AS total_payments, SUM(amount) AS total_amount
FROM payment_events
GROUP BY createdBy

Finding the Peak Hour for Payments:

SELECT HOUR(from_unixtime(cast(eventTimestamp as double))) AS hour, COUNT(*) AS total_payments
FROM payment_events
GROUP BY HOUR(from_unixtime(cast(eventTimestamp as double)))
ORDER BY total_payments DESC
LIMIT 1

This query assumes eventTimestamp is stored as a Unix timestamp. Adjust the conversion if your timestamp format differs.

Payments Status Counts:

SELECT status, COUNT(*) AS status_count
FROM payment_events
GROUP BY status

Additional Considerations

Data Partitioning: For performance and cost optimization, consider partitioning your data in S3 based on frequently queried attributes like date or createdBy. Update your Athena table definition to reflect this partitioning.
Athena Costs: Be mindful of the costs associated with Athena, which are based on the amount of data scanned by each query. Partitioning and compressing your data can help reduce costs.
Data Types and Formats: Ensure your data types in the Athena table definition match those of your JSON data. Athena’s ability to query your data correctly depends on this alignment.

By following this guide, your article can provide readers with a comprehensive understanding of setting up AWS Athena for analyzing payment events data, demonstrating how serverless architectures can be extended with powerful data querying and analytics capabilities.

AWS Glue

Why AWS Glue is Not Used Here

In our exploration of leveraging AWS Athena for analyzing payment events in a serverless architecture, AWS Glue does not play a central role. The primary reason for this is the nature and structure of our data processing and analysis needs. AWS Athena, in conjunction with AWS Lambda and Amazon S3, provides a robust and efficient solution for querying and analyzing structured data directly from S3, based on predefined schemas that match our data format (e.g., JSON from DynamoDB Streams). Since our focus is on direct, ad-hoc querying of structured data without the need for complex data transformations or the orchestration of multiple data sources, AWS Glue’s powerful ETL (Extract, Transform, Load) capabilities are not required in this context.

Scenarios Where AWS Glue is Essential

However, AWS Glue excels in scenarios where businesses face complex data integration challenges or require advanced data processing pipelines. Here are two examples where AWS Glue becomes indispensable:

Integrating Multiple Data Sources: For businesses that need to analyze data coming from diverse sources and formats (e.g., logs from web servers, sales data from relational databases, customer data from NoSQL databases), AWS Glue can automate the discovery, preparation, and combination of this data into a unified format for analysis. Its ETL capabilities allow for the extraction of data from these varied sources, transformation of the data into a consistent schema, and loading of the transformed data into a data store like Amazon Redshift or S3 for analytics. This is particularly relevant for companies undergoing digital transformation initiatives that require consolidation of legacy data with new data streams for comprehensive analytics.
Complex Data Processing Workflows: In scenarios where data must undergo multiple, complex transformations before analysis — such as cleansing, deduplication, normalization, or joining with other datasets — AWS Glue provides a managed environment to design, implement, and monitor these ETL jobs. For example, a marketing analytics firm might need to enrich clickstream data with demographic information from another system, requiring sophisticated processing to align the datasets on common keys and aggregate the data for segmentation and targeting purposes.

While AWS Glue is a powerful service for complex ETL jobs and data integration tasks, it is not leveraged in our current serverless architecture due to the straightforward nature of our data querying and analysis needs. However, businesses facing more intricate data challenges or those needing to orchestrate diverse data sources and transformations will find AWS Glue an essential component of their data processing and analytics strategy. By understanding when and where to employ services like AWS Glue, organizations can craft a data architecture that best fits their unique requirements, ensuring they can extract maximum value from their data assets.

Conclusion: Harnessing the Power of AWS for Seamless Serverless Architectures

The exploration of AWS services such as Athena, Glue, S3, and DynamoDB reveals a comprehensive ecosystem capable of addressing a wide array of data processing, storage, and analysis needs within a serverless architecture. Each service plays a pivotal role in enabling businesses to design highly scalable, efficient, and cost-effective solutions that cater to specific operational requirements, from straightforward data querying to complex ETL workflows.

AWS Athena stands out as a powerful query service that allows direct analysis of data stored in S3, using standard SQL without the need for server management. Its integration with serverless architectures empowers businesses to perform ad-hoc data analysis at scale, providing insights into data with simplicity and speed. Athena is particularly beneficial for scenarios requiring immediate insights from large datasets, enabling decision-makers to derive actionable intelligence without extensive data preparation.

AWS Glue, on the other hand, addresses the challenges of data integration and preparation across disparate sources and formats. Its managed ETL capabilities automate the process of data discovery, mapping, and transformation, facilitating the creation of reliable, clean data repositories ready for analysis. For businesses dealing with complex data landscapes or requiring sophisticated data processing pipelines, AWS Glue offers the tools necessary to streamline data workflows, enhancing analytics outcomes and operational efficiencies.

Amazon S3 serves as the backbone of data storage within the AWS ecosystem, providing durable, scalable, and secure object storage. Its integration with services like Athena and Glue allows for seamless data analysis and transformation workflows, ensuring data is easily accessible yet securely managed. The flexibility and cost-effectiveness of S3 make it an ideal choice for storing vast amounts of data, from logs and events to structured and unstructured data sets.

Amazon DynamoDB complements this suite of services with its fully managed NoSQL database capabilities, offering fast and predictable performance with seamless scalability. For serverless architectures, DynamoDB provides a reliable, high-performance data storage solution that supports event-driven programming models, making it suitable for a wide range of applications, from mobile backends to real-time analytics platforms.

When combined, these AWS services offer a robust framework for building serverless architectures that are not only scalable and efficient but also highly adaptable to changing business needs. The seamless interoperability between Athena, Glue, S3, and DynamoDB enables businesses to craft comprehensive data strategies that leverage the best of AWS capabilities, ensuring data is not just a resource but a strategic asset driving innovation and growth.

In conclusion, the journey through AWS’s serverless offerings highlights the transformative potential of cloud technologies in redefining how businesses approach data management and analysis. By embracing the serverless paradigm, organizations can accelerate their digital transformation initiatives, fostering a culture of data-driven decision-making and unlocking new opportunities for innovation and competitive advantage.

Empower Your Tech Journey:

Explore a wealth of knowledge designed to elevate your tech projects and understanding. From safeguarding your applications to mastering serverless architecture, discover articles that resonate with your ambition.

New Projects or Consultancy

For new project collaborations or bespoke consultancy services, reach out directly and let’s transform your ideas into reality. Ready to take your project to the next level?

Email me at one@upskyrocket.com
Visit My Partner In Tech for custom solutions