Streamling Data Flow: From S3 Buckets to Validated Insights with Serverless Automation

Babajide Onamusi
5 min readJan 22, 2024

--

AWS Python Boto3 and Lambda

In today’s data-driven world, ensuring the accuracy and integrity of data is imperative for all businesses; Companies understand the importance of reliable data to inform investment decisions and deliver exceptional values to their clients. The amount of data accumulated is increasing in volumes and such data is generated in divergent systems and requires gathering into a specific location for analysis and insight generation.

Scenario: A company (“NextGenCapital”) sets up billing information for our clients once a month. To streamline the data quality, validation process and maintain consistency at the highest standards, a cutting edge solution using AWS Lambda, Python Boto3, and Amazon S3.

The Challenge: Manual Data Validation Barriers

Traditionally at NextGenCapital, Data Validation involved manual processes, which hindered efficient scaling, prone to errors, and time-consuming. The company feeds data from clients on different time zones, and manually reviewing, validating data files stored in S3 buckets limited data pipelines. The Development and Operations teams struggled with this challenges which in turn affect the Finance Teams making timing investment decisions.

The Solution: Automating Data Validation with AWS S3, Lambda and Boto3

The DevOps Team at NextGenCapital embraced the power of automation to overcome these bottlenecks. A serverless architecture using AWS Lambda and Python Boto3 to automate the validation of data files uploaded to the S3 buckets.

Amazon S3: The S3 buckets are used to stored the financial data. This cloud storage provides NextGenCapital with a scalable and secure platform for data storage and access.

AWS Lambda: NextGenCapital will use serverless to trigger data validation as soon as new data is uploaded to an S3 bucket. This enables validation of real-time data without managing infrastructure.

Python Boto3 Library: Boto3 is leveraged to interact with S3 and Lambda from the code.

The Following Diagram illisurates NexGenCapital’s Solution Architecture.

PROCEDURE:

  • A lambda Function is setup which is triggered whenever a new csv file is uploaded to an S3 bucket.
  • The Lambda function validates the file for errors and if any discrepancies are found, the file is moved into an ‘error’ bucket.

PROCEDURE-1:

  1. Create a Lambda function.

2. The DevOps Engineering Team are using VS Code as the IDE; AWS Toolkit is installed and the ‘NextGenCapital’ Lambda function is downloaded.

3. Import the necessary modules for the date operations,

  • Initialise the s3 resource.
  • Extract the bucket name and CSV file from the event.

4. Define the Error bucket

5. Here the object is receive, then extract the data.

6. This is the Flag we raise when we find error.

7. Define the NextGenCapital Products and Currencies.

In order to test the Lambda function, the ‘template.yaml’ and ‘event.json’ file are created in the same folder.

8. Then it is test locally.

Note: The Engineers uploaded the “billing_data_nextlend_may_2023.csv” file into the s3 bucket.

  • Next run the code to view the csv file.
sam local invoke -e event.json

PROCEDURE-2:

  1. Here, the Engineers import the code.

2. Checking to see if the product_line and currency is valid.

3. Checking if errors are found.

  • Check your s3 to check the file transferred into the error file.

PROCEDURE-3:

Finally, the Lambda function is uploaded into the Lambda environment; then tested by creating a trigger and set the proper IAM policies.

  1. To upload, select the directory in the IDE and select upload Lambda.

2. Fill in the necessary credentials and select continue on this message.

3. The code is in AWS Lambda

NextGenCapital’s Lambda

4. Next, select configuration and permissions; then select the arrow to update the permissions.

5. Once there, the “s3FullAccess” policy was attached. Then add trigger, select s3, the billing bucket and PUT.

  • Then, the trigger was created.

6. Upload the Nextwell csv file into the nextgencapbilling and the file will be pushed into the nextgencapbillingerrors file.

Benefits of this AWS Architecture to NextGenCapital

This automated data validation system transformed the way NextGenCapital handles its data.

  • Accelerated Decision Making: NextGenCapital reported faster decision making processes and with great confidence.
  • Enhanced Data Quality: The Real-time validation ensured that NextGenCapital identifies anomalies in data instantly, limiting the exposure of errors and downstream issues.
  • Increased Scalability: The serverless architecture deployed by the Engineering team allows NextGenCapital to easily scale data processing capabilities as the company volume grows.

Summary

In conclusion, NextGenCapital demonstrated how it was able to overcome automating data validation system. The code used with Boto3 library to implement the validation logic, discussing the Lambda function and insights into the overall performance of this architecture in NextGenCapital.

See the Links to understand the AWS Services better!

I believe that this approach can be valuable for any business that relies on accurate and reliable data for its operations.

--

--