Data Integration Architecture on AWS: API Gateway, Kinesis, Firehose, Lambda, and S3 for Processing 250 Million Monthly Transactions

Daniel Alejandro Figueroa Arias
3 min readSep 25, 2023

Note: This blog post discusses a data integration architecture used by a fintech company to process a large volume of daily transactions efficiently.

Introduction

The fintech company I work for provides various data integrations services, this document explains a granular sending of the transactions to a warehouse.

Goal

The goal of this document is to explain the architecture and data flow for integrating transactions sent as events to S3 using a combination of AWS services, including API Gateway, Kinesis, Firehose, Lambda for message partitioning, and S3.

Proposed Architecture

Components of the Architecture

API Gateway: This service serves as the entry point for transactions. We configured a REST API and set up an integration request with a Kinesis Data Stream. It’s crucial to specify the action (PutRecord), and the execution role must have API Gateway as an entity with write permissions to the Kinesis Data Stream.

In the HTTP headers, “Content-Type: application/x-amz-json-1.1” should be included. Finally, the mapping templates should specify “application/json” and use the following template:

{
"StreamName": "StreamName",
"Data": "$util.base64Encode($input.body)",
"PartitionKey": "partitionkey"
}

Amazon Kinesis: It acts as a bridge between API Gateway and Firehose. The advantage is that it has longer message retention times and a scaling system managed by AWS according to requirements.

Amazon Kinesis Firehose: We configured the Kinesis Data Stream as the source and S3 as the destination. A Lambda function was enabled to transform the data.

Lambda function: retrieves a batch of transactions based on either a buffer or time. For each message, it generates partitions based on two fields, in this case, partition1 and partition2 (which must be included in the message). Finally, a newline is added to each message. Here’s the code for the function.

import base64
import json
import copy

def create_partitions(message):
partition_keys = {
"partition1": message["partition1"],
"partition2": message["partition2"]
}
return partition_keys

def lambda_handler(event, context):
output = []
for record in event['records']:
payload = base64.b64decode(record['data'])
json_string = payload.decode("utf-8")
message = json.loads(json_string)
partitions = create_partitions(message)
final_message = json.dumps(message)
output_json_with_line_break = final_message + "\n"
encoded_bytes = base64.b64encode(bytearray(output_json_with_line_break, 'utf-8'))
encoded_string = str(encoded_bytes, 'utf-8')
output_record = copy.deepcopy(record)
output_record['data'] = encoded_string
output_record['result'] = 'Ok'
output_record['metadata'] = {'partitionKeys': partitions}
output.append(output_record)
return {'records': output}

Once the batch of messages is returned to Firehose, it writes to the specified bucket with dynamic partitions in the following format:

MyKey/partition1=!{partitionKeyFromLambda:partition1}/partition2=!{partitionKeyFromLambda:partition2}/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/

It’s essential to clarify that the keys obtained by the Lambda function must be added to the code.

Amazon S3: This is the destination for the data from Firehose. A partitioned path is generated with the structure: partition1/partition2/year/month/day/hour/. The files are compressed in gz format, and each line contains messages in JSON format.

Costs: We assume that approximately 250 million transactions will be received per month. Here’s an estimated breakdown of costs:

  • API Gateway: $870/month
  • Kinesis Data Stream: $40/month
  • Firehose: $40/month
  • Lambda: $50/month
  • S3: $1/month

The total cost of the solution is approximately $1,001/month.

Conclusion

The proposed data integration architecture on AWS offers a robust solution for the fintech company to process a substantial volume of transactions efficiently, securely, and cost-effectively. By combining the power of API Gateway, Kinesis, Firehose, Lambda, and S3, this architecture empowers the fintech company to streamline reconciliation, gain valuable insights, and maintain data integrity.

--

--