Faster Ruby APIs with Postgres

Leonardo Brito
Goiabada
9 min readJan 14, 2020

--

Say you have a Rails app (just for the sake of example — this article pertains just the same to pure Ruby apps) and want to build a JSON API. Simple enough, just set up a api/v1/my_resourcesendpoint and write up a serializer for MyResource, then just serve that through a controller.

Perhaps you're one of those party people. Let's build a real simple app with a Party model as example:

$ rails new faster-json-api --database=postgresql 
$ cd faster-json-api
$ rails g migration CreateParties name:string description:string starts_at:datetime ends_at:datetime

Pretty simple: a name, a description, start and end times.

Now onto the API. Let's keep things super simple: there will only be a single endpoint, api/v1/parties, which responds with a JSON of all the parties in the database.

A typical Ruby-based serializer could look something like this:

module PartySerializer
def ruby_to_json
all.to_json(only: %i(name description starts_at ends_at))
end
end

A few notes: we could (should, actually) use pluck in the above example, but we're going to go with the :only option in to_json just to highlight the differences between doing this kind of processing on the Ruby side and doing it on the database side. Also note that ActiveRecord has its own serialization helper that can be used to achieve the same results as above.

Okay, then we can use PartySerializer like so:

class Api::PartiesController < ApplicationController
def index
render json: Party.extend(PartySerializer).ruby_to_json
end
end

Fair enough. Let's do a quick test with curl to see what we get (once we populate the database with seed data and set up the endpoint in config/routes.rb):

$ curl localhost:3000/api/parties | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 144 0 144 0 0 662 0 --:--:-- --:--:-- --:--:-- 663
[
{
"name": "My awesome party",
"description": "Incredible awesomeness",
"starts_at": "2022-01-01T23:00:00.000Z",
"ends_at": "2022-01-02T02:00:00.000Z"
}
]

(jq is a JSON pretty-printer for bash, by the way)

Seems to work just fine. You might even choose to ditch Ruby core-based JSON formatting and use some fancy super-fast gem to speed things up.

Regardless of whether you use a gem to serialize the model into JSON or not, here’s a rough breakdown of how this Ruby-based serialization approach works. Again, we're using Rails as an example but something very similar will happen regardless of the framework (or lack thereof) that you use:

  1. ActiveRecord translates your query to SQL. In our example it was a pretty simple query: Party.all, which translates to SELECT * FROM parties;
  2. ActiveRecord queries the database with that SQL statement and waits for the response;
  3. The database runs the query and responds with something;
  4. ActiveRecord receives that data and builds ActiveRecord::Base models (in our example it builds Party models);
  5. PartySerializer#ruby_to_json calls #to_json for each of the Party models returned in the previous step.
  6. Finally, the controller returns that JSON string to whoever requested it.

Whew! Looks like a lot of work. What if we could skip those intermediate steps entirely and just have the database do all the work for us? Here's what that would look like:

  1. We provide a SQL statement to ActiveRecord::connection.exec_query;
  2. Database runs the query and responds with JSON;
  3. The controller returns that JSON string to the client.

Looks much simpler, right?

Let's analyze the differences between those lists. In the second list we’re running a SQL statement right from the start, essentially skipping step 1 from the first list, so there’s a little less overhead in translating ActiveRecord syntax to SQL.

But the biggest difference here is that we skipped steps 4 and 5 from the previous list, which load the query results into memory and builds ActiveRecord models. In the end we're delivering JSON, so there is really no need to do that (unless we really need something from the Ruby world: e.g. if we were doing some kind of ETL where the Transform step was done in the Ruby process).

As you might guess, the magic happens in step 1 and 2: we need to build a SQL query that the RDBMS will run and respond with a JSON. Postgres has some pretty awesome built-in JSON functions and they're just perfect for this use case.

json_build_object to the rescue

json_build_object takes a list of keys and values and returns a JSON object. In our case, the keys are the JSON keys in the API response, and the values are their respective column names in the database. So we can expect this:

json_build_object(
‘id’, id,
‘name’, name,
‘description’, description,
‘starts_at’, starts_at,
‘ends_at’, ends_at
)

To return objects like these:

{
"name": "My awesome party",
"description": "Incredible awesomeness",
"starts_at": "2022-01-01T23:00:00.000Z",
"ends_at": "2022-01-02T02:00:00.000Z"
}

But we want to return a list of objects, right? That's the cue forjson_agg.

json_agg aggregates values as a JSON array. Those values are, in our case, the JSON objects we just built. So let's glue the two together.

json_agg(
json_build_object(
'id', id,
'name', name,
'description', description,
'starts_at', starts_at,
'ends_at', ends_at
)
)

Those expressions need to be SELECTed, of course. Here's the final query:

SELECT
json_agg(
json_build_object(
'id', id,
'name', name,
'description', description,
'starts_at', starts_at,
'ends_at', ends_at
)
)
FROM parties

That produces exactly the same output as our Ruby-based JSONification.

We should JOIN forces

The examples we used are nice enough, but they seem awfully simple: they're entirely flat, that is, there is no nesting at all.

If you want to nest values that are already part of the model, that is pretty simple: just use json_build_objectagain. So something like:

SELECT
json_agg(
json_build_object(
'id', id,
'name', name,
'description', description,
'dates', json_build_object(
'starts_at', starts_at,
'ends_at', ends_at
)

)
)
FROM parties

Would produce what you'd expect:

[
{
"id": 1,
"name": "My awesome party",
"description": "Incredible awesomeness",
"dates": {
"starts_at": "2022-01-01T23:00:00",
"ends_at": "2022-01-02T02:00:00"
}

]

But what about if you want to nest another table within that JSON? Say our Parties also have many Sweepstakes, each with a name and description (both are strings), and we want the API to return those as a JSON array in the Party JSON. So we want the final result to be like this:

[
{
"id": 1,
"name": "My awesome party",
"description": "Incredible awesomeness",
"dates": {
"starts_at": "2022-01-01T23:00:00",
"ends_at": "2022-01-02T02:00:00"
},
"sweepstakes": [
{
"name": "My awesome Sweepstake",
"description": "Pure incredibleness"
},
{
"name": "My second awesome Sweepstake",
"description": "Purer incredibleness"
}
]

}
]

The thing is, you can't nest aggregation operators such as json_agg in Postgres, so you can't just do a JOIN on Sweepstakes and then aggregate them within the ongoing Party aggregation.

What we need to do is use json_agg inside the JOIN, and then put those values inside the Party aggregation:

SELECT
json_agg(
json_build_object(
'id', parties.id,
'name', parties.name,
'description', parties.description,
'dates', json_build_object(
'starts_at', starts_at,
'ends_at', ends_at
),
'sweepstakes', s.json_agg
)
)
FROM parties
LEFT JOIN (
SELECT
party_id,
json_agg(
json_build_object(
'name', name,
'description', description
)
)
FROM sweepstakes
GROUP BY party_id
) s ON s.party_id = parties.id

I know, I know: it seems awfully complicated compared to just a regular ActiveRecord joins. But it will be worth it, I promise.

Give me numbers!

You promised me more performance, give me some numbers! We're going to do some benchmarking now using this sample app.

First we need to understand a few things about Ruby's benchmark module. As far as we're concerned there are two important numbers in action here: total and real. The former is the total CPU time your code took, and the latter is the “wall clock” time, i.e. how much time actually went by during the execution.

Since in our optimization we’re essentially offloading work from Ruby's process to Postgres' process, then we should expect the Ruby approach to have greater CPU time relative to real time, while the Postgres approach should have less CPU time relative to real time (Postgres does, of course, use CPU cycles, but they aren't counted in Ruby's benchmark because they're a part of another process). In other words, the Ruby approach should spend most of real time on CPU operations, while the Postgres approach should spend most of real time in IO operations.

With that said, here's sample benchmark output measuring both approaches (pure Ruby x Postgres) with 100 Parties in our database:

            user     system      total        real
ruby 0.128114 0.000183 0.128297 ( 0.129308)
postgres 0.000642 0.000046 0.000688 ( 0.015426)

As we expected, the Postgres approach consumes considerably (~200x) less CPU time compared with the Ruby approach. Also as expected, IO time (real — total) is much more relevant in the Postgres approach than in the Ruby one: ~95% of real time was spent with IO, against ~0,7% with the Ruby approach.

So was the optimization worth it? It definitively seems so: the sample indicates something in the order of a 10x improvement in real time spent.

Nice, but a benchmark with only 100 elements in the database seems too small to matter. Let's try measuring the same code from 1k up to 10k elements:

The chart helps us see, graphically, what we already talked about a few paragraphs above: for the Ruby approach, total time matches real time almost exactly, while IO time for Postgres (the area between the yellow and green lines) is much larger than for Ruby (area between blue and red lines — nearly non existent). We can also easily notice the orders of magnitude in play: real time spent in the Postgres approach is close to one order of magnitude below the equivalent time in the Ruby approach. Note that the charts have a log scale vertical axis.

The chart above tested the simple case — the very first one we discussed (flat models). How about when we have a slightly more complex query like the one we presented later on with a LEFT JOIN?

Seems more or less the same, except that the distance between Postgres and Ruby's real time are slightly smaller overall. It also seems that by the end of the lines, Postgres seems to be catching up with Ruby. Let's extend those lines a little bit by repeating the test with up to 25k elements:

We're well past the 10k elements from the previous benchmark and the lines haven't caught up, so it is pretty safe to assume that they are indeed parallel, and the ~10x time difference holds for at least that sample size of 25k elements. If we wanted to really put this to test we might want to repeat the benchmark with, say, 1M elements and smoothen the lines with some kind of moving average, but what we have is enough for this article.

Another really important thing to note here is that our ActiveRecord model is extremely simple — it has a single has_many relationship and nothing else. That is seldom the case for real-world models. If we have a model that has tons of post-initialization code and many virtual (computed) columns, then the performance penalty for initializing them in the Ruby approach will be even greater than what we've seen in the benchmarks.

Wrapping it up

Our little benchmark suggests that using Postgres to build JSON responses offers a ~10x performance increase compared to letting Ruby do the work — varying a bit depending on the complexity of the query and the model. As we mentioned before, there are many gems that boast of being many times faster than Ruby-based serialization. Are they faster than native Postgres serialization? They might as well be; I'd have to put that to test though.

Anyway, if you're building an API with Ruby and Postgres, you now know you have this option available. But is the improved speed worth the extra effort of building and maintaining those queries?

Now that is an entirely different matter. While there are businesses that would die for a 10% speedup in a critical bottleneck of their software, there are plenty situations where people won't care for our 10x performance boost (and where the next developer that takes on the project after you're gone might curse and haunt you in your sleep for those long SQL queries). So I guess the answer is, boring as it may be: it depends.

--

--