Refactoring our Payment Processor

Alex Cusack
Engineering at Earnest
4 min readOct 23, 2017

At Earnest, the core piece of infrastructure responsible for money movement is what we refer to internally as our ‘Payment Processor’. The Payment Processor is responsible for creating all monetary transactions, including generating deposits for newly originated loans as well as processing payments for our actively serviced loans.

The Payment Processor as originally architected met our needs well, but as the number of loans we service has increased, the time it takes to execute and the frequency of failure has also increased. Payment Processor failures involve money movement and accounting, so any failure triggers an incident that requires our Servicing engineering team to react immediately.

As our portfolio has grown over the last four years, the Payment Processor has evolved to a task that executes three times a day, with each execution taking ≈60 minutes and the overall execution time increasing linearly as a function of the number of active loans in our portfolio.

Given the continued linear decrease in performance, our Serving engineering team decided it was time to revisit the architecture.

Our existing architecture could be outlined as:

  1. Read all loans and related attributes
  2. Bucket loans into one or more of four buckets, based on the actions that need to be applied to the loan
  3. Apply actions (creating deposits, payments, tax transactions, etc.)

An important point of this architecture is the design assumed the process would execute as a single non-parallelizable task that addressed our entire loan portfolio in a single run.

Our refactoring objectives were:

  1. Increased reliability: A single loan exception should not require the entire process to retry all loans
  2. Future scalability: The solution should support 10x our current loan volume. (“design for ~10x growth, but plan to rewrite before ~100x”)
  3. Cost of implementation: Earnest’s Engineering is a small team responsible for managing over a billion dollars in loans, so we’re constantly evaluating the cost / benefit of technical projects to ensure we are working on the right things.

The solution model we found best suited to meet these goals was to refactor the Payment Processor for horizontal scalability — we wanted a system that would allow multiple Payment Processor nodes to execute in parallel, with each node addressing a different shard of our loan portfolio.

Addressing our architectural objectives in order, we first refactored for increased reliability. Because the Payment Processor’s overall execution is idempotent, we can choose to optimize for harvest rather than yield. In this case we’re thinking of harvest as:

(interesting paper from this talk on harvest vs. yield in distributed system design)

To achieve this, we refactored the way the existing Payment Processor writes transactions from a function that writes all transactions for all loans as a single Serializable Postgres Transaction to one that writes transactions for each loan as a distinct Postgres Transaction. This means a single loan can fail to write, and be safely rolled back, while the rest of the loans in the batch can still succeed. In the prior state, a single loan failure would cause the entire execution to fail and be retried (still a 60-minute process at this point). Now we can rollback and retry individual loans as they fail.

Next, we refactored the processor to optionally target a set of loans rather than operating over all loans with each execution. The target loan selection was implemented to be a function of modulus, where

Using modulus instead of fixed ranges allows the processor to automatically redistribute loans into buckets as our portfolio grows and allows our parallelization to be easily adjusted based on modulus read from a configuration file.

The Payment Processor task is triggered via an Amazon SQS queue, and though the Payment Processor as a whole operates idempotently, undesired concurrent executions result in a failure in the final stage of execution.

To avoid undesired concurrent executions we needed some sort of mutex that could be used to lock loans that were being operated on by a specific Payment Processor instance. To achieve this, we decided to use Postgres’s Advisory Locks.

For those unfamiliar, Postgres’s Advisory Lock is an application level lock Postgres offers that allows a Client to request a lock on an Integer that, once the lock is granted, will reject or block on any future lock requests for the same Integer from other Clients. Advisory Locks worked well for us since we wanted to lock on entire ‘buckets’ of loans, rather than taking individual locks out on each loan.

To lock on the entire set of loans being targeted by our modulo operation, each Payment Processor instance takes out a lock on f(payment-processor-<bucket>) where f is a function that hashes the String to an Integer.

To lock on the entire set of loans being targeted by our modulo operation, each Payment Processor instance takes out a lock on f(payment-processor-<bucket>) where f is a function that hashes the String to an Integer.

If a requested Lock has already been granted, the Processor instance will log and exit without making any changes.

After making these changes we added a lightweight supervisor task responsible for taking in a modulo and enqueuing new Payment Processor tasks, each configured to target one of the loan buckets: range(0, modulo — 1).

After completing the refactor, the Payment Processor’s execution time has decreased from 60-minutes to 14 (we’re currently running it with a parallelization factor of four) and the frequency of failures has decreased significantly (we’ve yet to have one since rolling out the new version! (*knock on wood*))

Interested in projects like this and the future of banking? Our team is hiring! Feel free to email me at ‘alex [dot] cusack [at] earnest.com’ or visit our careers page.

--

--