“We’ll do that later”: how to improve your HTTP response cycle
When we started developing inventid in Ruby on Rails we quickly realized several things:
- Ruby is not particularly fast,
- Nor is doing stuff inside an HTTP request,
- Yet we have to maintain a ridiculous level of fast and concurrent HTTP responses to function.
That sounds like a dilemma. And at first it was. Until at some point we grasped the concept that not everything needs to be done right now. This is true for a lot of business decisions, but it even “truer” to IT development. Moreover, some stuff simply needs to be done later on!
So we started to separate our HTTP request cycle from background jobs.
Due to the nature of our business, we often need to do some work now, and validate its result in the future. A basic example is checking for payment expirations: most noticeably the Dutch iDeal payment system does not provide a data push on changes, but has a so called “haalplicht” (in Dutch): you are required to manually poll for changes at the moment it is likely changes have been made.
Hence imagine the following scenario:
- A user starts an order,
- A user starts the payment for the order,
- A user completes the payment,
- And the user closes the screen.
At this moment, there is no way for a e-commerce platform to get notified on the payment completion. Instead the platform is required to:
- Hold the order and,
- Poll periodically (time-based, or after a time-out value) what the current payment status is.
Obviously, this is ill-suited for any HTTP request cycle! Early on in development, we therefore adopted the (quite lovely) sidekiq library so we could schedule these kinds of re-checks in a low cost fashion during the HTTP cycle. At execution time, the check would verify the payment status (if and only if it was still open), and save the results.
That works for so many things!
Indeed! Scheduling work for the future means that we can do an awful lot of things. After this revelation, we adopted sidekiq throughout our stack to ensure we always do the absolute minimum for a user to continue.
Examples may include:
- Sending an email
- Generating PDF tickets
- Updating analytics data in Elasticsearch
- Firing Kafka events
- …and a whole lot more but you get the jist
Save sanity and money
Additionally, this provides one with the possibility to ease workload over systems. inventid may regularly see traffic spikes of 10.000x our regular traffic (especially when popular sales hit the market). Based on basic business, we cannot continuously keep a worst-case scenario of 10k+ online for incidental usage (which may be for 10 minutes a day for example).
As a result, cheaply deferring actions (note the emphasis on cheaply) is extremely cost effective and crucial to maintain our pricing. Let me walk you through an example of high-traffic ticket sales:
- You order your tickets, which get reserved during the HTTP request
- You start your payment
- You complete your payment and return to our platform (for sake of simplicity)
- Suddenly, multiple things happen instantly!
What’s that? Lemme show you!
- Your confirmation email is scheduled for generation, only passing your order_id along. No reference, just a primitive.
- Your etickets are scheduled for generation, only passing your order_id along. Another primitive.
- Kafka events are fired for your completed orders, using only a static version_id, which in our system unique identifies the state of a specific entity in time. Yet another primitive!
- Elasticsearch updates are performed based on your version_id. At this point you probably get this is just a primitive too
Each of these actions may (and likely will) be executed on different machines at different times. In case of high-traffic, we are able to start these background workers in a matter of seconds (Thanks #someCloud! Not taking credits for this).
But the important thing: it frees our precious HTTP requests from doing anything at all! Yes, your email may take 2 seconds to be sent, and e-ticket generation could be queued for an additional 5 seconds… But the matter of fact is, a user will not go from a payment completion to their inbox and be disappointed by a 7-second delay. Because other factors will dominate this time.
But what about what we want to do now?
Basically, we believe in the fact that wanting to do anything immediately which is not required during the HTTP cycle as a waste. Instead, we ensure that every version of every entity can be requested by any service. Therefore, we can even fire Kafka CUD events delayed.
curl -H ‘X-Session-Token: xxxxxx’ https://api.inventid.nl/versions/1234567
"timeout_after": [null, "2015-11-13T15:17:02.944Z"],
"status": [null, "open"],
"user_id": [null, 221244],
"shop_id": [null, 2],
"fee": [null, 0],
"created_at": [null, "2015-11-13T15:12:03.027Z"],
"updated_at": [null, "2015-11-13T15:12:03.027Z"],
"id": [null, 199217]
Using this approach, everything can be inferred. As one can see, we can see that version_id 1234567 was the creation of an order on November 2015. It was scheduled to timeout at 15:17:02 UTC, and we even know the exact code tag which was in effect on that server at the time!
In hindsight, I’d like to convey there is very little you’d want to do now. The above proves itself as a great way for auditing purposes, and ensures you decouple your HTTP cycle from everything. The latter helps scalability.
No matter in which way you may get hammered on HTTP requests (which you do not control), you are able to control your workers. You can scale them up, or even cancel work.
I’d like to end with a very famous quote:
By taking that kind of control, you are taking control of your uptime, stability, and reliability as well. And you can build a business on that!
Yes, I wrote it myself ;)