An on-call developer’s worst nightmare (red indicates errors)

Member-only story

How I Scaled a Software System’s Performance By 35,000%

Dive into how I resolved a platform’s scaling, stability, and performance issues through caching, jobification, queue separation, and more.

Joseph Gefroh

Published in

The Startup

21 min readJul 22, 2020

Processing over $20,000,000 in a single day

A previous company built payments systems and giving day software intended for massive giving days where we would receive tens of thousands of donations for a single campaign.

One of my responsibilities at that company was to scale the system and ensure it didn’t topple over. At its worst, it would crash on just 3–5 requests per second.

Due to a inefficient architectures, questionable technology choices, and rushed development, it had many constraints and was a patchwork of band-aids and gaping performance gaps. A combination of magical spells and incantations would keep the server running throughout the day.

By the time I was done with the platform, it had the potential to manage several thousand requests per second and run thousands of campaigns simultaneously, all for roughly the same operational cost.

How? I’ll tell you!

Analyzing the usage patterns

Before we dive into how I optimized this system, we have to understand its usage patterns and the specific circumstances and constraints which we are trying to optimize under — to do otherwise would be to shoot in the dark.

Giving days have defined starts and stops

RPS: Giving days started and ended suddenly.

Giving days are massive planned events, scheduled months in advance. They start and stop at very specific dates and times. Sometimes these dates are moveable. Other times it is not.

There’s an emphasis on sharing

During the campaign, the effort to get the word out to donate can be intense.

Our system might send out hundreds of thousands of emails at the very beginning of the day, with…

The Startup

How I Scaled a Software System’s Performance By 35,000%

Dive into how I resolved a platform’s scaling, stability, and performance issues through caching, jobification, queue separation, and more.

Processing over $20,000,000 in a single day

Analyzing the usage patterns

Giving days have defined starts and stops

There’s an emphasis on sharing

Create an account to read the full story.

Published in The Startup

Written by Joseph Gefroh

Responses (8)