Understanding the Raison D’être of the Mighty Queue
Knowing when and how to deploy this simple and versatile tool can be an extremely powerful option in data system design.
The following scenario truthfully occurred in my career, and was the moment the queue’s purpose in a system “clicked” and became clear in my mind. I hope that sharing this problem, the resulting struggle, and how I eventually solved it has a similar effect on you.
It’s a Friday afternoon at the office (remember those?) and you are getting ready for the weekend. You have already closed your laptop and are mingling with your co-workers to see where the happy hour spot is this week.
Suddenly, Brett from the marketing department approaches you — a respected senior engineer at the company — and says, “Sorry, but I have a favor to ask. We need to send a $15 off in-app coupon to all our customers in the past year before Monday, can you do it? The leadership team is concerned about sales numbers this quarter and really wants this promotion to go out.”
Although you’ve already accomplished several impressive things during your time at the company, you remain eager to further prove yourself, especially on something with such senior visibility. Before rushing to agree to the task, you realize there are a few things that need to be clarified.
“How many people is that?” you inquire.
“It’s 50,000 in total. I’ll send you a list of the
user_id’s in a single-column CSV file. Make sure each user gets one coupon, and no users get more than one… does that work?”
“Yup,” you naïvely confirm that in fact, it does. “Shouldn’t be a problem.” And out you head to happy hour, figuring you’ll handle it the next morning after an iced coffee.
You wake up early and get your typical eggs, hash browns, sausage, and extra-large iced coffee.
You open your laptop and look over the
promo.csv file sent to you at 5:56PM yesterday. It takes several seconds to open.
As promised, there’s 50,000 rows in the file. Sweet.
You pull up a Jupyter Notebook, and begin composing a simple script to loop through each row, and send a POST request to the internal Rewards API.
Interlude for the Beloved Reader:
How would you do this? Think for a second how your script would look.
Thought a bit?
Let’s continue on.
After only 10 minutes of coding you produce the following:
You are performing 3 steps:
- Read in the csv file into a pandas dataframe
- Loop through the dataframe rows and make a POST request to your company’s internal promotions endpoint, passing in each user_id
- Raise an error if the
status_codeof the request does not equal 200
You put your fingers over the “shift” and “return” buttons on your keyboard to execute the notebook cell.
Only now — with your fingers hovering over the fateful keys — do a series of concerns pop into your head.
“Hold on a sec…how long will this run for? I better put some log statements every 100 rows so I know it’s running…”
Phew, good call.
Next, you remember that it is unwise to overload an API with requests, especially your company’s hastily-developed promotion service. So you decide to put a
time.sleep(1) in between each request:
Looks great. Okay, each request will take 1 second… multiplied by 50,000 requests… equals nearly 14 hours!
Woah! You didn’t realize it would take that long.
Moments ago you were about to kick off this puppy, now you aren’t sure if you can run it at all.
You’ve done things like this before on 10, 20, even up to 100 users. But you didn’t realize the challenges that arise when scaling it to thousands…
Still, Brett and everyone else are expecting you to complete this. You said you could do it. You decide to make one final change before running:
Instead of raising an error at the end, let’s keep track of all
user_id’s from unsuccessful requests and save them in an
error.csv file at the end. Fair?
Before doubt can creep in, you hit run!
4 hours later……
You check up on your laptop. The scipt stopped runniing an hour ago and only made it through the first couple thousands users before getting stuck.
With no other options, you bite the bullet and manually split the 50k row csv file into 50 parts of 1k users each.
The rest of your weekend is spent queueing up your script 50 times. Surely there must be a better way?
You arrive at your office Monday morning and tell your favorite co-worker the weekend’s harrowing tail of sending coupons to 50,000 users.
After hearing of the struggle, he matter-of-factly asks, “Why didn’t you just use a queue?”
“Huh?” you reply.
“It’s simple. Instead of looping through the users and making the promo service requests in the same script, it’s better to place each
user_id as an individual message on a queue. Then run another script to read from the queue and make the service requests.
If it succeeds, the message will get deleted from the queue. If it fails, the message will not be deleted and automatically re-tried by default with something like AWS’ SQS queue service.”
Your mind is abuzz. Why didn’t you realize the functionality of queues provide the exact behavior needed to solve the problem you faced?!
You head to the coffee machine where you see Brett. “Got those promotions sent?” he asks.
Armed with newfound confidence from your understanding of queues you reply, “Yep, got any more?”
Queues work by getting in the middle of producers and consumers of data. In our case the CSV file of
user_ids is the producer of data, and the Promo API service is the consumer.
Instead of interacting directly with each other, they both interact with the queue itself. This allows for asynchronous processing of messages, where messages placed onto the queue by the producer are durably stored until picked up the consumer.
Perhaps most critically is the way a queue provides fault-tolerance to errors. The way this works is messages become “invisible” while being processed by a consumer — but if never explicitly deleted — will eventually become visible in the queue again.
In this way, failed messages become re-tried (up to a certain number of times) while successful messages get deleted once successful.
To see specific code examples of this process in action with the SQS and Lambda AWS services, stay tuned for Part II!