“We’ll do that later”: how to improve your HTTP response cycle

When we started developing inventid in Ruby on Rails we quickly realized several things:

  1. Ruby is not particularly fast,
  2. Nor is doing stuff inside an HTTP request,
  3. Yet we have to maintain a ridiculous level of fast and concurrent HTTP responses to function.

That sounds like a dilemma. And at first it was. Until at some point we grasped the concept that not everything needs to be done right now. This is true for a lot of business decisions, but it even “truer” to IT development. Moreover, some stuff simply needs to be done later on!

So we started to separate our HTTP request cycle from background jobs.

Handle expirations

Hence imagine the following scenario:

  1. A user starts an order,
  2. A user starts the payment for the order,
  3. A user completes the payment,
  4. And the user closes the screen.

At this moment, there is no way for a e-commerce platform to get notified on the payment completion. Instead the platform is required to:

  1. Hold the order and,
  2. Poll periodically (time-based, or after a time-out value) what the current payment status is.

Obviously, this is ill-suited for any HTTP request cycle! Early on in development, we therefore adopted the (quite lovely) sidekiq library so we could schedule these kinds of re-checks in a low cost fashion during the HTTP cycle. At execution time, the check would verify the payment status (if and only if it was still open), and save the results.

That works for so many things!

Examples may include:

  1. Sending an email
  2. Generating PDF tickets
  3. Updating analytics data in Elasticsearch
  4. Firing Kafka events
  5. …and a whole lot more but you get the jist

Save sanity and money

As a result, cheaply deferring actions (note the emphasis on cheaply) is extremely cost effective and crucial to maintain our pricing. Let me walk you through an example of high-traffic ticket sales:

  1. You order your tickets, which get reserved during the HTTP request
  2. You start your payment
  3. You complete your payment and return to our platform (for sake of simplicity)
  4. Suddenly, multiple things happen instantly!

What’s that? Lemme show you!

  1. Your confirmation email is scheduled for generation, only passing your order_id along. No reference, just a primitive.
  2. Your etickets are scheduled for generation, only passing your order_id along. Another primitive.
  3. Kafka events are fired for your completed orders, using only a static version_id, which in our system unique identifies the state of a specific entity in time. Yet another primitive!
  4. Elasticsearch updates are performed based on your version_id. At this point you probably get this is just a primitive too

Each of these actions may (and likely will) be executed on different machines at different times. In case of high-traffic, we are able to start these background workers in a matter of seconds (Thanks #someCloud! Not taking credits for this).

But the important thing: it frees our precious HTTP requests from doing anything at all! Yes, your email may take 2 seconds to be sent, and e-ticket generation could be queued for an additional 5 seconds… But the matter of fact is, a user will not go from a payment completion to their inbox and be disappointed by a 7-second delay. Because other factors will dominate this time.

But what about what we want to do now?

curl -H ‘X-Session-Token: xxxxxx’ https://api.inventid.nl/versions/1234567{
"id": 1234567,
"item_type": "Order",
"item_id": 199217,
"event": "create",
"whodunnit": "221244",
"object": {},
"created_at": "2015-11-13T15:12:03.027Z",
"ip": "149.210.xxx.xxx",
"host": "srv3",
"git_tag": "deploy_2015.11.12.20.04",
"object_changes": {
"timeout_after": [null, "2015-11-13T15:17:02.944Z"],
"status": [null, "open"],
"user_id": [null, 221244],
"shop_id": [null, 2],
"fee": [null, 0],
"created_at": [null, "2015-11-13T15:12:03.027Z"],
"updated_at": [null, "2015-11-13T15:12:03.027Z"],
"id": [null, 199217]
}
}

Using this approach, everything can be inferred. As one can see, we can see that version_id 1234567 was the creation of an order on November 2015. It was scheduled to timeout at 15:17:02 UTC, and we even know the exact code tag which was in effect on that server at the time!

Conclusion

No matter in which way you may get hammered on HTTP requests (which you do not control), you are able to control your workers. You can scale them up, or even cancel work.

I’d like to end with a very famous quote:

By taking that kind of control, you are taking control of your uptime, stability, and reliability as well. And you can build a business on that!

Yes, I wrote it myself ;)

Software Engineer. Lead software engineer @ Magnet.me, former CTO @ inventid.nl. General nerd. github.com/rogierslag