Improving Insights Latency

Published in

Compass True North

6 min readSep 17, 2019

One of the services we provide our agents at Compass is some basic analytics on their listings and marketing, which we call Insights. Insights back end is mostly a series of overnight data processing jobs but the API endpoints do some of the work on the fly, such as aggregating for selected date ranges, and the main Insights API endpoint is among the slowest at Compass.

Insights above the fold view (sample dev data)

Having been at Compass for just under three months, taken over responsibility for maintaining the Insights back end from my predecessor, I got the task to “Improve Insights endpoint latency.” Speeding up endpoints is a nice high profile fix and one of the general goals of our CTO. At an average of 2.7 seconds according to our monitoring tools, with the worst case being a few multiples of that, there is (hopefully) room for improvement.

But as an analytics heavy service, I reacted immediately with some expectation management. If the underlying db calls are already well indexed, and at first glance it seemed they were, then further improvement would require major architecture changes such as further preprocessing and caching. This task is an investigation and any actions are based upon findings.

Let’s start with the API endpoint itself and work backwards.

The main listings Insights page is loaded with data all returned by a single endpoint. It’s already doing a lot of work: all aggregate listings Insights for a given agent in one API call. It returns about a 9,000 line JSON, when pretty printed, including data built by 16 different DataFrames.

Chart data from listingViewsBySocialTrafficDataframe

The scope of this task is getting pretty large. I decided to choose one at random, listingViewsBySocialTrafficDataframe. The idea is that this query will give a sampling of how fixable the problem is.

Identifying the query is somewhat more difficult than you might expect. Searching the source code brings you to a YAML file where that DataFrame is defined, including the main SQL query used. These queries are constructed dynamically with some hidden preprocessing which is not trivial to reverse engineer. Because of an ID scheme migration last year, some tables are dynamically renamed from the version in the YAML file before the query is run. Other tables in the query are really embedded DataFrames themselves. Due to our use of PostgreSQL, these embedded DataFrames can be in different schemas.

Insights uses YAML configs extensively

Ultimately, it took a little while to reconstruct the actual query being run. It takes a few hundred milliseconds to execute with a random test account, but the real info is finding out how the db makes use of indexing. Time to run EXPLAIN.

EXPLAIN results before changes

The results are a little scary for someone new to PostgreSQL. We are looking for two things. First is the cost. A line like “cost=19700.57..19700.65” shows you an arbitrary scale, but you want to find where this number grows. The biggest growth is where the most time is spent. Drilling down finds the section we are looking for:

-> Nested Loop (cost=0.29..18578.27 rows=542 width=18)

This is the section where the db spend the most time generating results. Now that we narrowed down the most costly part of the query, time to figure out why. This is the second thing we are looking for, the indexing.

Bullseye:

-> Seq Scan on agent_insights__property_by_user (cost=0.00..11641.69 rows=395901 width=50)

That “Seq Scan” means this is not using any index, and it has most of the cost to do that sequential scan right there. So despite the system being well indexed overall, there are some sections which are not. Notice this table is not in our source, its in an embedded DataFrame. It wasn’t just luck though, this embedded DataFrame is used repeatedly so at least half the choices from the initial API call would have led us here.

We have a big fish on the line, time to reel it in.

In most places, you would just have your friendly neighborhood dba add an index and be done with it, but Insights is not so simple. The overnight jobs reconstruct the data in its entirety nightly, and recreate the tables anew each time. This allows for easy backup and restore, so in case of a failed job we can reuse yesterday’s data unaltered on a table by table basis. But this also means we need to update where the table is generated.

Insights is a Compass in-house designed ETL system configured with a series of YAML files, defined in a series of Thrift files. The config was simple to find but introduced another curve ball.

Its already indexed, but not for this query

These tables are already heavily indexed but something wasn’t quite right. This is where it helps being thorough. Looking through other configs, I found other tables had a section called “indices” in the plural and would have a list of columns, and the syntax was that the columns level is in a list as well — but that list is never used. I double checked the Thrift file definition and the implementation in the source code. Sure enough, it would make an index for each column group. But it looks like when it was built, the columns list was incorrectly used alone for fields queried against.

I suspect this was an oversight and a widespread issue in the system. But first I need to finish our testing sample and prove this fix really works.

1 index we needed here, and 1 more index for a different query

Adding an index for the fields queried against does improve performance. Since this was given importance from up top, I kept track of how much. Time to run EXPLAIN again and compare.

Those numbers, of unknown scale, got nice and small

Going from a cost of 19700.65 to 1102.46 is an 18 times improvement. With indexing another part of this query as well, I eliminated all sequential scans.

The test sample showed phenomenal results, now for the other 15 DataFrames in the main API call and to check the other API calls used for different Insights pages such as for Ads and Public Collections. Tracing each use and reconstructing the queries to run EXPLAIN on each would take time.

Lots. Of. Time.

So I decided to go with a shortcut.

All the DataFrames are in the same folder, and the problem comes down to ensuring each table is indexed on the columns being queried against. So I simply went through each file sequentially and added indices as needed.

Going through the other queries, I realized that it was not just the WHERE clauses which should to be indexed, but the JOIN, GROUP BY, and ORDER BY clauses as well, as these essentially do more scans. This led to a few elaborate indices but I’m optimizing for SELECT speed only, not space or INSERT/UPDATE speed, so added whatever appeared useful.

Fortunately, no index got more crazy than this

Out of 69 files, I found a need for 38 new or modified indices, and five or six queries simple enough to modify the existing WHERE clause without digging into the logic.

I’m not certain ordering a WHERE clause to match the index order matters so long as all columns are present, but taking no chances. The expanded indices were for missing columns scanned against though.

This meant applying fixes without investigating the exact needs or speedup of each API call component, but with a high level of confidence this is addressing real issues.

Prod results, before on the left, after indexing on the right

We can, however, measure the resulting speedup. Using a few choice test accounts with high volume of data, it cuts the API endpoint’s load time in half.

The before and after, black line about when the new indices were applied

Checking the actual metrics for a more accurate overall picture, we still see a strong improvement in the week after over the week before. The minimum latency (except for one 3ms outlier) dropped by 150ms, or 25%, the maximum latency dropped roughly in half, and the average latency is hovering around 1.5 seconds, or a 1.8 times speedup.

While I’m sure I’ll eventually get asked the harder task of bringing that average below one second, it gets Insights off the CTO’s slow query radar for now. I’d call that a win.

Improving Insights Latency

Let’s start with the API endpoint itself and work backwards.

We have a big fish on the line, time to reel it in.

Lots. Of. Time.

Written by Luigi Kapaj (PuppyKhan)