Serverless Pepperfry wallet using AWS QLDB

Published in

Pepperfry Tech

8 min readSep 15, 2022

Inception of Idea for using AWS QLDB:

For the past 5 years, we were using third party service to manage customer wallet transactions at Pepperfry. It has its pros and cons due to which we decided to keep it Inhouse. The two major deciding factors were-

Performance — for any issue, we need to wait for a longer time period than desired/acceptable for its resolution. Also the request hit rate was limited.
Cost — There were per transaction charges needed to be paid to third party so if more transactions then more cost we need to bear.

After weighing the pros and cons of build vs buy, we decided to build the wallet system in-house, on the condition that we would piggy back on an immutable database. For this, we choose AWS QLDB.

For Information on Aws QLDB, you may refer below links-

QLDB Introduction and Core concepts

QLDB High Level Overview

Why QLDB is a perfect fit in our case:

Earlier we were relying on third party to manage customer wallet transactions. The logical separation of different buckets(Promotional and Cashback) and their corresponding credit/debit transactions history was managed at third party DB and in our MySQL DB. Later we cache data using Redis to reduce load on our database. In this way we can access data with microsecond latency.

Apart from the layers for performance enhancement, it raises multiple concerns of data tampering, its verifiability for audit purposes.

Data Tampering — Wallet data, though kept secure, is stored in MySQL DB, which would still rely on a human, to prevent tampering.

Data verification — The only way to assure data verifiability (for any customer) was to go for wallet resync procedure provided by third party. We fetch information from their API’s and sync wallet data in our DB. But it does not guarantee the data verification part because there is no cross check mechanism at our end to validate the data.

To cater to all the above concerns, we decided to use AWS QLDB, where data is completely immutable. It does not mean that we can’t delete/update the data in QLDB, it means the Journal keeps the entire history of any change in data. This ensures an audit trail of changes made to the data, by whatever way possible, including SQL updates/deletes.

Secondly, when a transaction is committed in QLDB, a SHA-256 hash is created and it is combined with the hash of previous transactions and gets updated in Journal. We can cryptographically verify that the specific document is in the same location in the Journal and has not been altered in any way. This ensures data verifiability. For more info, please refer to the AWS QLDB verification.

Security at QLDB:

There are two states of data — data at rest and data in transit. QLDB enforces encryption in both states.

Data at rest security — In QLDB, all data stored in Aws QLDB is fully encrypted at rest by default. It encrypts our data using 256-bit Advanced Encryption Standard (AES-256). This helps secure our data from unauthorized access to the underlying storage.

Data in transit security — QLDB only accepts secure connections that use the HTTPS protocol, which protects network traffic by using Secure Sockets Layer (SSL)/Transport Layer Security (TLS). Encryption in transit provides an additional layer of data protection by encrypting our data as it travels to and from QLDB. This ensures the data security aspect as well.

But Why QLDB?

I think the question remains the same i.e. why QLDB, not blockchain. Blockchain also provides immutability, privacy, security, and transparency features just like QLDB.

The answer is because of the non-complex nature of our use-case. Unlike Blockchain, there is no need for multiple party consensus(consensus algorithm is core part of Blockchain network) in QLDB. Also, Blockchain is a distributed decentralized network but in our case, we don’t want to share ledger with any third party and QLDB provides central trusted Ledger.

*Figure a. Selection of Database Technology*

Also, QLDB has an auto scaling feature so we don’t have to worry about its provisioning capacity or increasing its read or write limit. Depending upon the demands of the application, QLDB automatically scales.

Data Design of QLDB:

Event Carried State Transfer pattern

QLDB supports event-driven workloads through QLDB Streams. QLDB Streams is a feature that allows changes made to the journal to be continuously written in near real time to a destination Kinesis Data Stream. Consumers can subscribe to the stream, and trigger downstream events.

QLDB Streams carry the full state of a document revision in the events that are streamed, using the Event-Carried State Transfer pattern.

General Pattern-

*Figure b. Streaming QLDB data to trigger downstream events*

In the above figure b, QLDB streams captures all document revisions committed to journal and delivers them in real time to Kinesis data streams. Consumers can consume data from streams to drive event based logic for real time analytics and feed data from journal to data warehouses, relational database, elasticsearch and other downstream systems.

Our Use Case:

In our case, Customer Wallet needs to be maintained for all transactions corresponding to a wallet. We use QLDB to store a permanent and complete record of credit/debit transactions, rather than building complex record keeping functionality at the application level. We decided to keep our audit log tables in QLDB to benefit from its inherent immutability, completeness, and verifiability.

This way it simplifies keeping secure audit logs.

*Figure c. Streaming of transactions from QLDB to RDS*

High level flow for our system:

*Figure d. High Level Architecture for smooth flow of data from QLDB to RDS*

Failure cases during streaming and their handling:

As we did not know anyone, who had used QLDB in production, for the use case that we were using for, we encountered our share of issues.

Failure is a part of learning:

Initially during migration of 30L customers on the staging platform, we didn’t face any issue or data loss. So we decided to proceed with the same approach on Live environment.

A day before go-live day, post running scripts for migration of user wallet’s, we checked the data at RDS end and found only 50% of wallet info, migrated. We looked at the cloudwatch dashboard and found multiple issues from API gateway, kinesis shard usage (checked by keeping enhanced monitoring ON), lambda max retry exhausted etc. This is where we understood that our system was not ready to cater for the load that we were aiming for. We worked upon those issues in detail and in a week, our system was ready post load-testing each component in the flow. :)

I have tried to explain each issue and respective solution in the points below.

Data loss from QLDB to RDS

Initially we encountered various cases where data loss occurred from QLDB to RDS.

In case the cause of loss is not known, then it’s better to create a new QLDB stream with a start date, from which, journal will start afresh a new stream for the data. Here the end date should be specified, otherwise current & new streams will be streaming the same data. Also, logic for removal of duplicate data should be at consumer end, before feeding data into RDS.

2. Kinesis Stream Read data throttle

If read data throttling, occurred at Kinesis end (can be found by cloudwatch metrics), that means no. of records being pushed from QLDB streams to kinesis is greater than 1000 per shard at a time, or no. of bytes transferred is greater than 1 MB per sec per shard. Data loss will occur from QLDB to kinesis in this case.

Note: We are using Kinesis data stream in provisioned capacity mode. Please refer to Kinesis Data Stream Capacity Mode for more details.

In such case, Increase the number of shards to cater load and create new QLDB stream to retrigger data from QLDB to Kinesis. ( For calculating exact shard counts required, you may refer Kinesis provisioned mode shard calculation)

Solution-

3. Data loss from Kinesis to Consumer End

The scenarios in which data loss may occur from Kinesis to Consumer end, is given below.

A. Lambda concurrency increased

i) Lambda provisional concurrency increased more than 1000 at a time

ii) Reserved concurrency increased more than we specified for lambda.

B. Lambda exhausted Max Event Age

C. Lambda exhausted Max Retry Attempts

(Refer link AWS Lambda Asynchronous invocations handling to understand how lambda handles asynchronous invocations through max event age and max retry attempts)
For these cases, we set up a destination failure queue, so that messages are pushed to SQS and trigger another lambda function.

Please note here the messages received by SQS triggered lambda are in a different format: we get the stream id with shard information where that data is actually stored. The data records need to be retrieved from this information, before pushing it to RDS.

Solution-

4. Data loss from Consumer to RDS

A rare scenario we can get is, DB max connection exceeded issues. This happens when the load coming to RDS, is more than max connections it can make.

It happened in our case(refer point 1) when we created a new stream with specified start and end date to start streaming data from QLDB Journal. When data volume is huge between that duration, then all of a sudden, we may see load at consumer end and consequently more connections will be attempted on RDS to store that data. To solve this issue, we shortened the duration between start and end date and did the same activity in batches.

(Suppose we need streaming for past 6 hours then its better to do in 3 batches for 2 hours each rather than streaming at one single go)

It’s more of a coding problem, as we were creating a new RDS connection every time. We adopted connection pooling and every data sync thereon has been stable. We haven’t faced any max connection exceeded issues as yet.

Still to cater for this rare scenario, We can either increase the max connection number or we can set up SQS which pushes failed records into RDS through lambda function.

Conclusion:

As part of the e-commerce industry, multiple promotions are mapped to customer segments. There are numerous credit/debit wallet transactions (wallet points) occurring each day on the platform before (during customer registration) and after placing the order. It’s very useful to create our own wallet service and manage/scale at our end.

This article explains our whole journey of migrating wallet service from third party to Inhouse. During the process of development, we faced many hurdles and tried our best to resolve them. As time goes by, we’ll add more points in the failure handling section.

Also, to minimize operational overhead and maximize scalability we use a serverless approach that uses Amazon API Gateway, AWS Lambda, Amazon Quantum Ledger Database (QLDB), Amazon Kinesis Data Streams and Amazon SQS.