Cratejoy list view performance: The technical deep-dive

This is the second part of our series on Cratejoy performance. While our first post focused mainly on the user experience and goals, today we’ll share more about the technical aspects of how we accomplished our performance goal of less than 1000 millisecond response time.

We will identify the problems we had with our previous implementation and go through the various steps we took in order solve those problems. Hopefully you’ll pick up some general tips and advice that can be applied to similar situations.

Identifying the problem

Too generalized

We had a lot of general purpose functions that were used for almost all our list views. While reusing code is a preferred practice, it comes with a trade-off. We lost the ability to fine-tune the functionality for the specific list views. A change that made one list better can have adverse effects somewhere else.

The layers of abstraction also made it harder to get a good overview of what is going on, and what side effects any changes can have. The query was passed around to functions in different files. When working with this code, keeping the mental context is extremely complex, making changes harder.

Difficult to optimize

The list views are paginated, so there is only a limited number of results we need back from the query. This makes it possible to have millions of rows, and still be able to quickly select only a few of them to display to the user. However, we also need to display the total count of all the rows with whatever filters are applied to the list view.

We were using the same query to get both the counts and the rows which prevented us from optimizing the two cases individually. Attempting to optimize a single query on a large dataset that provides two very different types of results can be futile. For example, the small subset needs to be ordered, but the counts do not.

A giant query

We used one giant query, that joined on a lot of tables and added a bunch of different filters depending on the situation. Trying to analyze such a query is very difficult simply because spotting where the performance problem is a mess. Even if you do find something to fix, chances are that a different combination of filters or sort order will be negatively affected by the fix.

The Postgres query planner does its best, but with so many variables the result can be very unpredictable. For example, you might want to join on a table because you need to filter out specific items, but doing so can cause the query planner to rethink it’s strategy and choose a sequential scan on another table (which is very slow). Instead, if you know there are only a handful of items you need to filter by, you can query those out beforehand in a separate query and use the results from that in your main query.

Given all of the above, we decided on rewriting the list view from scratch, highly optimized for one single purpose: delivering the data in the least amount of time possible. But before we came to that conclusion we first considered a number of other possible solutions.

Outsource or in-house

All of the 3rd party solutions we came up with had one big drawback, we’d have to come up with a way to make sure that the data is always perfectly in sync. Because of this, we can’t solely rely on an external service; we’d still have to fill in the gaps if all the data was not available remotely. Another big factor was that we would always have to have a fallback solution which means we would have to maintain two different systems.

After some testing, we decided that we would be able to make an in-house solution that was both fast and reasonably scalable.

The solution

Breaking it up

Instead of one giant query, we split it up into a few logical parts. Each of which has the task of getting intermediate values that we’ll later feed into the main query. These queries return few rows and are tested to make sure that they are fast and use the correct indices. Since the queries are small (few, if any, JOINs), the chance of them suddenly, for non-obvious reasons, starting to perform badly are slim.

As mentioned earlier, we also split the query up into two separate parts, one that gets the paginated results, and another that does the counts. This allows us to optimize them individually with great benefits. For example, using a CTE (common table expression or WITH clause) makes sense for the counts query, because the CTE will always select all the rows, while for the paginated results, a subquery is the faster alternative.

Still, getting the counts takes between 10x — 100x the time of getting the paginated results, but with the two queries separated we have the option to defer the counts and have it appear after the user already got the results back.

Optimized queries

A good approach to writing a query is to first write it in straight SQL and test it out on the command line, using EXPLAIN ANALYZE <query> and figuring out how to make it as fast as possible. Once you’re happy with the results you can translate that into SQLAlchemy CORE and make sure that the query that’s executed is identical to the one you wrote.

There are multiple ways of accomplishing the same task, for instance you can JOIN a table, or filter with IN / ANY / EXISTS given a subquery or CTE, and the difference between them can vary greatly depending on the application. Try out a lot of different ways to get the same data back and settle for the one that’s the most performant. A lot of trial and error is often required for this part, unless you are very accustomed to how the specific DB works

Specialized indices

An often overlooked performance gain is the use of partial indices. Consider a table of shipments that has millions of rows, and each shipment has a status that is shipped, unshipped, or cancelled. The amount of shipments that are cancelled makes up less than 1% of those, but filtering for those can be very slow since they are so spread out. This is a perfect candidate for a partial index on status == cancelled. There’s no real downside, the index will be tiny and the gains are huge. As general advice, inspect your tables and see if you have any booleans, or status fields that could benefit from such an index. Use your knowledge of the shape of the dataset to inform how you design your indexes.

Another type of index to consider is a composite (multi-column) index, one that indexes two or more columns. If you’re always filtering by the same two columns, then a composite index on those will greatly improve performance. Without the composite index, the database has to do the work twice. Such an index can however get quite large, so only create one if the gains are worth the size of the index.

Removing ORM

At Cratejoy, we use Python and PostgreSQL for the backend, with the SQLAlchemy suite to provide ORM (Object Relational Mapping) functionality.

The old code relied heavily upon the ORM. The ORM is really great for writing easy-to-comprehend queries, but not ideal when performance is key. The reason we decided to ditch the ORM for most of this task was because the python object creation is a waste of time, and there could possibly be other magic going on behind the scenes. We wanted to eliminate any possible performance inhibitor.

Instead we went with the SQLAlchemy’s core functionality, which is much faster and use a lot less memory than ORM. Core is very advantageous if you are only interested in receiving a list of values, and not whole objects, like when query for the IDs of the items we want to eventually display. We still use ORM when we’re fetching the final paginated results, but the time it takes is negligible.

The results

Arriving at the final results was an iterative process. We made a simple script that benchmarked a few dozen common and uncommon list view filter combos. We would save the results to disk so future runs could be compared to the previous results. This way we made sure that we were always progressing in the right direction and allowed us get quick feedback once we made a change.

When we were finally done tweaking and testing, we were quite happy with the results. We had achieved speedups between 100% and 10,000% in our benchmarks, with an average of about 700% faster than the previous implementation. Things were finally feeling really snappy!

Below is a chart that shows the response time improvements for one of our larger merchants who has over 1M shipments.

Our old implementation wasn’t necessarily bad, it was just not able to keep up with the rapid growth that we have been experiencing lately. With a thorough measuring strategy, well-defined goal, and deep understanding of the bottlenecks, we were able to apply these solutions to multiple lists effectively.