Don’t cry, Mongo: analyzing index use on MongoDB.
As the volume of data stored in PlanGrid grows, it becomes increasingly important to make smart decisions about how we query our data: bad queries can degrade the performance of the entire database, making various parts of PlanGrid slow.
While managing indexes on any database is crucial, careful index management can sometimes fall by the wayside in the midst of extinguishing fires in a rapidly growing startup. We noticed the performance of our database degrading, but getting information about our query patterns to actually address the issue proved difficult. In order to understand what was slowing the performance of our database, we needed to understand how our services were querying for data.
Our MongoDB instance was generating a vast repository of log files containing detailed information about what lookups are being made, how data is being sorted, which indexes are being used, and more, so this was the obvious place to turn for answers.
Getting Going
In order to mount a full-on investigation, we retrieved a regular business day’s log of traffic and teased out information about the queries we made. For brevity, we will skip the details of our text-parsing adventures on Mongo log files. If you need to go down a similar road, you might look towards mtools.
Once we could reliably extract information from the logs, we put information about our data lookups into a single SQLite table:
CREATE TABLE queries (
db text,
collection text,
query_keys text,
hint text,
sort text,
lookup_type text,
lookup_index text,
tuples_returned integer,
tuples_scanned integer,
exec_time integer
);
With the data in a query-able format, we can quickly aggregate the results. With this data, we could target some specific problems, namely finding unused indexes, finding unindexed queries, and finding poorly indexed queries.
1. Finding Unused Indexes
In order to speed up our data access, we use indexes liberally. However, as requirements change and as code gets refactored, indexes fall out of use. While we were conducting this investigation, PlanGrid had several hundred gigabytes of indexes, so removing unused indexes meant greatly reducing the volume of data we store and maintain. Removing indexes is a boon for memory use since we can fit more indexes in memory; more indexes in memory means less disk access, which means better performance!
Using our SQLite DB, we ran a query to find all of the indexes we were actually using:
SELECT lookup_index FROM queries GROUP BY lookup_index
With this list, we could use a list of all indexes to compute a list of unused indexes, which are candidates for removal.
2. Finding Queries that Make our Hotlines Bling (or our Pagers Ring)
Aggregating information about our indexes is cool, but there was more work to do: looking for poor performing, long-running queries. Long running queries usually mean that there is scanning going on, which is indicative of a missing index.
The lookup_index
column in our SQLite database stores information on the technique Mongo used to find information in the database. Common values we see are IXSCAN, a lookup using an index, and COLLSCAN, a linear scan on the collection.
💡What’s a COLLSCAN? What’s a IXSCAN?
A linear file scan (or COLLSCAN) is the slowest way of looking up data. If you had a stack of books, a collection scan would be the equivalent of looking at each individual book until you found the one you wanted.
Indexes sort data. So when we use an index scan (IXSCAN), we can locate records quickly. Going back to the book-searching analogy: your books are sorted by author and/or title, so finding a book is fast.See Indexing and Performance Tuning for a visualization of the two access methods (first 20 slides)
Knowing that COLLSCANs are inherently slow, an easy win for us was to check if we have any unindexed calls that were slaughtering performance.
SELECT
collection,
query_keys,
exec_time
FROM queries
WHERE lookup_type = 'COLLSCAN';
One simple query later…
collection | query_keys | exec_time
-------------+-----------------+-----------
collection_1 | updated_on | 132,844
collection_2 | project | 17,300
Okay, cool, this information was both useful and actionable. Fixing this class of problem typically involves either changing the query (to use an indexed field), or adding an index to speed up these slow queries.
3. Just keep diggin’
Since most of our data was being accessed through indexes, it meant that there was still a lot of execution time that we hadn’t investigated, so we continued to dig. We ran a query that would sum all of the execution times for similar queries together. While we aggregated total execution time, we collected some other information as well:
- Which data values were we querying on
- Which index was used to perform the lookup
- How many tuples were scanned to fulfill the request
- Minimum/maximum/average execution time of a particular query
- Number of times a given query was run
SELECT
collection,
query_keys,
lookup_index,
tuples_scanned,
ceil(avg(exec_time)) as avg,
count(1) as executions,
min(exec_time),
max(exec_time),
sum(exec_time) as total_impact,
FROM queries
GROUP BY
collection,
query_keys,
lookup_index
ORDER BY total_impact DESC;
We aggregated information about the worst offending queries, and as you can see, we had some pretty bad offenders: the chart below compares execution times of various queries.
One of our worst queries ran fewer than a hundred times but ate 17 hours of execution time! 🚨🚨🚨 Yikes. The reason why we had such poor performance for this query is that we were scanning over a massive collection. When you do this, you end up spending way, way too much time on a given query. Fixing just a couple of these bad queries meant freeing up a lot of CPU time that can be used for something useful.
Conclusion (aka “profit”)
With the information from above, we went into action, immediately adjusted our indexes, and, magically…
To get a better idea of what this actually did to our overall query time, we grabbed fresh logs over the same time period to see what type of queries were running. By dividing the total query execution time by the number of queries, we could get an average query impact, and as you can see below, knocking off a few of the worst offenders made a big, big difference.
Timeslice 1: (total_execution_time_1 / total_queries_1) = 1158 ms/q
Timeslice 2: (total_execution_time_2 / total_queries_2) = 629 ms/q
It is best practice to look over multiple time slices both before and after optimization to control for outliers, but a simple calculation on two representative timeslices provides an example of impact.
Bonus points: alerting and monitoring
After building the tools and running the analysis on our indexes manually, we got our database into a much more performant state. To ensure our index utilization does not get out of hand, we deployed the tool to AWS Lambda and configured it to send results to New Relic. We even surfaced a Worst-Query leaderboard in our dashboard, right alongside other key performance indicators for our API.
Having built tools to extract meaningful performance data out of Mongo’s logs, we’ve armed ourselves with the data we need to optimize our queries. A happier database means happier engineers, and lets us save time, money, and maybe a bit of frustration, too.