Back in July and August, I was looking into a performance issue in Treeherder . Treeherder is a Django app running on Heroku with a MySql database via RDS. This post will cover some knowledge gained while investigating the performance issue and the solutions for it.
NOTE: Some details have been skipped to help the readability of this post. It’s a long read as it is!
Treeherder is a public site mainly used by Mozilla staff. It’s used to determine if engineers have introduced code regressions on Firefox and other products. The performance issue that I investigated would make the site unusable for a long period of time (a few minutes to 20 minutes) multiple times per week. An outage like this would require blocking engineers from pushing new code since it would be practially impossible to determine the health of the code tree during an outage. In other words, the outages would keep “the trees” closed for business. You can see the tracking bug for this work here.
The smoking gun
On June 18th during Mozilla’s All Hands conference, I received a performance alert and decided to investigate it. I decided to use New Relic which was my first time using it and it also was my first time investigating a performance issue of a complex web site. New Relic made it easy and intiutive to get to what I wanted to see.
The UI slow downs came from API slow downs (and timeouts) due to database slow downs. The API that was most affected was JobsViewSet API which is heavily used by the front-end developers. The spike shown on the graph above was rather anomoulous. After some investigation I found that a developer unintentionally pushed code with a command that would trigger an absurd number of performance jobs. A developer normally would request one performance job per code push rather than ten. As these jobs finished (very close together in time) their performance data would be inserted into the database and make the DB crawl.
Since I was new to the team and the code-base, I tried to get input from the rest of my coworkers. We discussed using Django’s bulk_create to reduce the impact on the DB. I was not completely satisfied with the solution because we did not yet understand the root issue. From my Release Engineering years I remembered that you need to find the root issue or you’re just putting a band-aid on that will fall off sooner or later. Treeherder’s infrastructure had a limitation somewhere and a code change might only solve the problem temporarily. We would hit a different performance issue down the road. A fix at the root of the problem was required.
I knew I needed proper insight as to what was happening plus an understanding of how each part of the data ingestion pipeline worked together. In order to know these things I needed metrics, and New Relic helped me to create a custom dashboard.
Similar set-up to test fixes
I made sure that the Heroku and RDS set-up between production and stage were as similar as possible. This is important if you want to try changes on stage first, measure it, and compare it with production.
For instance, I requested EC2 type instance changes plus upgrading to the current EC M5 instance types. I can’t find the exact Heroku changes that I produced, but I made the various ingestion workers to be similar in type and in number.
I had a very primitive knowledge of MySql at scale and I knew that I would have to lean on others to understand the potential solution. I want to thank dividehex, coop and ckolos for all their time spent listening and all the knowledge they shared with me.
The cap you didn’t know you have
After reading a lot of documentation about Amazon’s RDS set-up I determined that slow downs in the database were related to IOPS spikes. Amazon gives you 3 IOPS per Gb and with a storage of 1 Terabyte we had 3,000 IOPS as our baseline. The graph below shows that at times we would get above that max baseline.
To increase the IOPS baseline we could either increase the storage size or switch from General SSD to Provisioned IOPS storage. The cost of the different storage type was much higher so we decided to double our storage, thus, doubling our IOPS baseline. You can see in the graph below that we’re constantly above our previous baseline. This change helped Treeherder’s performance a lot.
In order to prevent getting into such a state in the future, I also created a CloudWatch alert. We would get alerted if the combined IOPS is greater than 5,700 IOPS for 6 datapoints within 10 minutes.
One of the problems with Treeherder’s UI is that it hits the backend quite heavily. The load depends on the number of users using the site, the number of pushes that are in view and the number of jobs that each push has determines the load on the backend.
Fortunately, Heroku allows auto scaling for web nodes. This required upgrading from the Standard 2x nodes to the Performance nodes. Configuring the auto scaling is very simple as you can see in the screenshot below. All you have to do is define the minimum and maximum number of nodes, plus the threshold after which you want new nodes to be spun up.
Troubleshooting this problem was quite a learning experience. I learned a lot about the project, the monitoring tools available, the RDS set up, Treeherder’s data pipeline, the value of collaboration and the importance of measuring.
I don’t want to end this post without mentioning that this was not excruciating because of the great New Relic set up. This is something that Ed Morley accomplished while at Mozilla and we should be very greatful that he did.