Speeding up our Listing Pages by 50%

Syed Mohsin
Engineering @ BuildZoom
6 min readFeb 25, 2019

On BuildZoom, contractor listing pages attract the majority of page views across our site from eager homeowners searching for builders in their area. Filled with rich data including reviews, aggregated project details, and high-quality images, these pages can be very taxing on our backend infrastructure, especially for listings in popular cities.

Top contractor results in San Francisco

In this post, we focus on techniques to identify and patch inefficiencies that plagued some of our slowest listing pages.

TL;DR: We discovered that often >50% of page load time was spent in slow application code and performed garbage collection analysis to diagnose and refactor the inefficient logic.

Why it matters

Waiting longer than 1 second nowadays for a page to load will test the patience of many users. Besides a poor user experience, slow pages can cause search engine crawlers to complain, negatively impacting SEO. If your servers live in the cloud and a component of your bill is CPU utilization, slow pages can also have a direct quantifiable cost.

To combat these issues, we use a caching system called Varnish which shields our servers from repeatedly running intensive queries and computations. However, it is important not to develop an over-reliance on Varnish or expensive queries and non-performant logic can build up to an unmanageable state.

Discovery

Recently, our listing pages ranked in our top 5 “Most time-consuming” pages with listings for high-traffic locations like Los Angeles and San Francisco taking around 10 seconds to load. It was clearly time to jump into action. We discovered this issue using NewRelic, our go-to analytics software that gives us valuable insights into the performance of our Ruby on Rails app.

NewRelic: some of our slowest load times
NewRelic: Listing page speeds over 1 week

Alarmingly, New Relic was reporting that a large portion of time was spent in application runtime (the brown section: SearchContractorsController#index) which meant unnecessary CPU burn, higher AWS bills, and potential disruptions to other parts of our application.

Generally, slow pages are caused by functionality that request data over the network, such as communicating with a database or a third party API. However, here we see the reverse with more time spent in application code than loading data.

While there are certainly database operations we can improve such as speeding up queries or removing N+1 database calls, we will focus on fixing the slow application logic that will provide the greatest ROI.

I, [2019–01–19T01:14:51.261468 #24840] INFO — : Completed 200 OK in 4572ms (Views: 523.0ms | ActiveRecord: 355.5ms | Elasticsearch: 210.5ms)

Hopping onto my development environment and grep-ing the logs, I confirmed that the majority of load time, about 3.5 seconds in this case, was spent in the Rails runtime as opposed to time spent hitting the MySQL database, calling external services, or rendering views.

This was about where NewRelic stopped being useful — it was great at finding slow queries but not so great at identifying bad logic.

Locating the offending code

In a typical Rails environment, slow application logic is often associated with inefficient processing of ActiveRecord objects. ActiveRecord is the Rails ORM which features a clean, object-oriented way to interact with a relational database and manipulate records. Unfortunately, the tradeoff is that ActiveRecord may require extra resources to provide that convenient functionality, which means a performance hit when processing large volumes of data.

Since a problem like this can be hard to track down, and I wanted to be confident that I’ve actually found the issue(s), I benchmarked every method and service that was called in the controller:

Though this simple logging approach can be tedious and time-consuming, it is exhaustive and effectively drills down to the target code. With enough patience and log-parsing, I found the slow code existed in a single method called best_photos_for_contractors.

This method is responsible for batch-loading and indexing high-quality contractor photos on our listing pages. The first thing we do here is query our database for all relevant images based on the input contractor_ids. Then, we iterate through each image, adding the image to a hash of keyed_images if it belongs to a certain contractor. Not the prettiest code, but not terribly complicated.

best_photos_for_contractors was called by 3 different services and took a whopping 1–1.5 seconds with each execution. Although the query to load relevant images was fast, the application logic for building the keyed_images hash single-handedly accounted for ~75% of load time for some of our slowest listing pages.

Why is this line so slow? Our Image model is generic, so image records do not have contractor_id as a foreign key that we can easily reference. Instead, we derive the contractor_id via an association called contractor_galleries. With each iteration of the loop, this line is instantiating EVERY contractor_gallery associated with each image just to grab the contractor_id off the first gallery. As we cycle through potentially hundreds of images, the constant allocation and deallocation of objects heavily taxed Ruby’s garbage collection system.

Analyzing Garbage Collection

For a deeper dive into Ruby’s garbage collection, check out this great post.

High-level languages like Ruby perform garbage collection, allowing developers to rapidly produce code without being burdened by manual memory management. The tradeoff is that garbage collection systems can consume additional resources and slow down or halt an application when overexerted.

To test our hypothesis that the garbage collection system was being overwhelmed in this block of code, I ran Rail’s garbage collection profiler:

The results were as expected:

I, [2019–01–23T22:34:38.641000 #17935] INFO — : GC::Profiler.total_time: 1.1358461729998859 seconds

Across multiple tests, this code block accounted for 1–1.5 seconds spent performing garbage collection as it rapidly instantiates and discards ActiveRecord objects. In processor time, this is an eternity.

The Solution

The goal of best_photos_for_contractors is to build an index of images by contractor_id. A simpler way to build this mapping is to bypass ActiveRecord and directly write a SQL query for the (contractor_id, image) mapping:

After some formatting, this raw query returns an array of tuples (contractor_id, image_id) that when looped over, is not memory intensive and does not instantiate ActiveRecord objects. Another set of tests running GC::Profiler indicated that this solution was successful in giving Ruby’s garbage collection system a break. The refactored method spends a fraction of the time in garbage collection:

I, [2019–01–23T23:02:02.512583 #21905] INFO — : GC::Profiler.total_time: 0.01238478140000085 seconds

After deploying this fix on January 29th, our production application showed great results with time spent in application code flat-lining and becoming negligible.

NewRelic: 50% faster load time after deployment on Jan 29

In Conclusion

As our Rails application and business grow, we will certainly come across more instances where convenient high-level logic can come at the price of performance. Hopefully, this can serve as a cautionary tale to always have an understanding of what’s going on under the hood.

In addition, we looked at the power of performing basic benchmarking, parsing logs to analyze results, and running garbage collection analysis as valuable tools to locating and refactoring inefficient code. I hope this knowledge can serve you as you optimize your applications!

If you made it this far and enjoy hunting for performance optimizations, consider our team at BuildZoom — we are hiring!

--

--

Syed Mohsin
Engineering @ BuildZoom

Engineering Leader | Full-Stack Engineer | Player-Coach — I write about engineering lessons I don't want to forget https://linktr.ee/syedbmohsin