Elasticsearch Performance Tuning to Handle Traffic Spikes
By: Scott Chiang
In this blog post, I will go over how we overcame performance issues on a production Elasticsearch cluster that powers TrueCar’s core business.
TrueCar is a digital automotive marketplace that makes the car buying process simple, fair, and fun. We offer millions of vehicles for car shoppers to choose from, which means that it is important for us to store and search large volumes of data in real time. Elasticsearch is the leading application search solution, because it is built to scale, works fast, and allows complex queries that produce precise results. We use Amazon Elasticsearch Service because it’s fully managed, scalable, and secure. There are many services at TrueCar that use this Elasticsearch cluster as a data source, which means that we need to set up a cluster for high availability even during big traffic spikes.
At TrueCar, we try to make the car buying experience as easy as possible. It is important to have up-to-date vehicle listings so car buyers can see what’s available, so we keep our inventory fresh. Our solution is to run a bulk listing update job multiple times a day to refresh the price, ratings, and availability of our inventory.
Our listings inventory is updated either from direct updates via our web application or a Hadoop job that processes our listings inventory for changes in price and availability. Both updates write new listings records to a Kinesis stream we will call Listings Stream. Next, a Lambda job, which we will call Index Lambda, consumes data from Listings Stream and writes the updated records to Elasticsearch. For updates coming from our web application, Index Lambda publishes a notification event in another Kinesis stream, which we will call Listings Status Stream, once it finishes writing to Elasticsearch. Listings Status Stream is consumed by our main backend service. This way, our main website knows about all listings records that were updated in Elasticsearch.
Problems we faced
Our Elasticsearch cluster performed very well for search, with important metrics like CPU utilization and garbage collection metrics all under key thresholds. But we noticed that those metrics spiked during inventory update job runs while the site was at daily peak traffic. To keep the cluster in a usable state, we had to slow down the indexing rate (how fast documents are being indexed) of the cluster. Without slowing down the indexing rate, the cluster’s CPU utilization spiked from 15 to 80 percent and garbage collection metrics increased fivefold, resulting in 503 Service Unavailable errors for any TrueCar services that made requests to the Elasticsearch cluster.
Figuring out the right batch size for batch requests, number of threads
We spun up a test cluster for tuning Elasticsearch cluster settings. We took a snapshot of the index from the production cluster and restored it to our duplicate cluster, simulating production search load using JMeter (an open-source application designed for load testing). Since our Index Lambda function consumes listings from Listings Stream to update records in ES, we took advantage of Kinesis’ ability to replay the stream. We set up a duplicate Lambda function that also consumed data from Listings Stream but wrote to our duplicate Elasticsearch cluster. Now we had an identical Elasticsearch cluster with production search and index rates for tuning.
We used bulk requests to index multiple updated listings records at once because doing so yields better performance than single-document requests. While it is tempting to index a high number of documents in each bulk request to speed up the indexing rate, too large a bulk size will put the cluster under memory pressure. The optimal size of bulk requests can be deduced by running benchmarks of varying document counts. The indexing rate will plateau as you approach the optimal size of a bulk request. In our case, we were using a bulk record count of 1000 records and a max chunk size of 64 mb. These settings worked fine when we initially set up the cluster but started to fail as the number of consumers increased. We discovered that the best bulk request setting was to cut both bulk record count and max chunk size by half. CPU utilization and GC metrics showed big improvements, with a decrease in the duration of a listings update job from about two hours down to 25 minutes without seeing spikes in key metrics.
The refresh interval is how often Elasticsearch will refresh your index. By default, Elasticsearch will refresh your index every 1 second. This is fine if you have low traffic and need newly indexed documents to be visible right away. In our case, we want up-to-date data on our site but don’t need it to be visible immediately. We were able to increase the refresh interval to 20 seconds for better performance.
One major problem of our existing system is that long old generation garbage collection pauses occur under a heavy load. When this happens, all requests going to shards on that node are frozen until garbage collection is complete. Under a heavy indexing load, these collections can take a few seconds or longer. To solve this, we wanted to investigate different GC and JVM settings on the cluster to reduce the GC impact during peak usage. Unfortunately, AWS Elasticsearch does not provide access to these settings, so we spun up our own cluster running Open Distro for Elasticsearch on EC2 instances. We set our JMeter test plan to simulate a high load to put more pressure on garbage collection.
In Elasticsearch, the heap memory is made up of the young generation and the old generation. The young generation needs less garbage collection because its contents tend to be short-lived. The default NewRatio for AWS Elasticsearch is about 15, which means the ratio between the old and young generation is 15 to 1. Increasing young generation size will improve performance because it will result in fewer GC processes happening. We decreased our NewRatio value to 1 and noticed drastic improvements. For simple search requests, P90 latency had a threefold improvement in response time, and aggregation request response times improved tenfold — a huge win with respect to the responsiveness of the cluster.
The default garbage collector in Elasticsearch is Concurrent Mark and Sweep (CMS). CMS won’t start until the old generation’s occupancy rate reaches the value set in CMSInitiatingOccupancyFraction. Problems arise when this value is too high, which results in delayed GC. There will be a lot of long-lived objects, which means that CMS will need more time to clear old gen. A recent GC option in newer versions of Java is Garbage First Garbage Collector (G1GC), which aims to minimize the amount of time the garbage collector must stop all application threads. G1GC divides the heap into smaller regions and each region can be young or old gen. The GC can decide to analyze regions where there is more garbage, reducing GC pause time by avoiding collecting the entire old generation at once.
Switching from CMS to G1GC resulted in huge performance gains. P90 latency improved approximately tenfold for requests. Search response time dropped from 31,649 ms to 226 ms. Aggregation search response time decreased from 6,076 ms to 758 ms.
Return of the timeouts
We were still seeing Elasticsearch timeouts after tuning our cluster to handle our peak traffic as well as the inventory update job. The problem occurred when we got even higher spikes of traffic, such as can occur, for example, when we get hit by bots scraping our site. Instead of letting the cluster timeout, we decided to throttle the Index Lambda function.
Under normal conditions, Index Lambda’s concurrency is set to 12. We have a separate Lambda function that monitors Elasticsearch cluster metrics and sets the appropriate concurrency. The throttle Lambda function uses young/old GC count, young/old GC duration, thread pool queue size, and thread pool rejected count. Each metric has to stay below a certain threshold, and when it breaks it, the throttle function decreases Index Lambda concurrency by a preset value. For our example, we will use a value of 4. If a metric does not drop below the threshold, the throttle function will again decrease Index Lambda’s concurrency value by 4. This can repeat until the value is 0. Once the violating metric drops below the threshold, the throttle function will slowly increase Index Lambda’s concurrency until it is back to 12. While this does not eliminate timeouts, it immediately relieves pressure on the cluster, allowing it to recover and continue serving critical requests.
In this blog post, we presented Elasticsearch performance issues that occur when our inventory update job runs during peak cluster usage. Rather than slowing down our index rate to alleviate pressure on the cluster, we tuned bulk request parameters and the refresh interval so the job could finish in 30 minutes without degrading the cluster. Unfortunately, the cluster still degraded if another unexpected spike in traffic occurred when an inventory update job was running. We solved this by adding a throttle job to decrease the concurrency of Index Lambda’s function. This combination of solutions has stabilized the Elasticsearch cluster. We know there are additional performance gains possible through other JVM and GC settings, and hope that AWS Elasticsearch offers alternate GC algorithms or tuning options in the future.
We are hiring! If you love solving problems, please reach out. We would love to have you join us!