How to Optimize Elastic APM

Published in

SquadStack Engineering

8 min readNov 11, 2020

An APM (Application Performance Management) is used to monitor and manage the performance and availability of software applications.

You can use it to figure out things like — time taken by an API, the time taken to run a database query, the time taken to run a function, etc.

Recently at SquadStack, we replaced New Relic APM with the Elastic APM. The main problem that we faced wasn’t the integration but to optimize the APM and Elasticsearch such that they don’t go down or drop APM events reported by the servers.

We spent hours and days searching and testing different ways to optimize.

In this article, I will discuss all the optimizations we considered/made to make sure that Elastic APM and Elasticsearch work smoothly.

APM Agent Configuration

Let’s first discuss the different changes that you can make to your APM agent configuration.

TRANSACTION_MAX_SPANS

This specifies the maximum number of spans that can be collected by the APM agent in a single transaction.

The higher the value, the more data is collected by the agent, and the more load it puts on your application servers and the APM server (in terms of RAM).

Depending on your use case you can change its value or monitor the number of average spans you get per transaction and tweak it. The default value is 500

STACK_TRACE_LIMIT

This refers to the number of frames captured for each stack trace. A higher value results in more data being collected by the agent, and hence more load on your application and APM server.

The default value is 500

SPAN_FRAMES_MIN_DURATION

Specifies the minimum duration of a span for which the stack trace will be collected.

Eg — If the value is 100 ms, then for spans with duration < 100 ms no stack trace will be collected.

A higher value results in less amount of data getting collected hence minimizing the load on your application and APM server.

The default value is "5ms".

API_REQUEST_SIZE

The APM agent works smartly and collects the data in a buffer before sending it to the APM server. With this parameter, you can limit the max. size of the collected buffer.

If you observe a spike in the RAM of your application servers after integrating with APM then you may want to tweak the value of this parameter. At any time multiple buffers are collected and stored in the RAM.

Sometimes even your servers may go down because of this. In that case, decrease the value of this parameter so that data can be sent quickly in small chunks.

The default value is "768kb"

API_REQUEST_TIME

This refers to the max. queue time of the request buffer before sending the request to the APM server.

This parameter works well along with the above parameter. Decrease the value of both of them and you can reduce the load on your application servers.

But now that we have increased the number of API requests, the load shifts on the APM server.

The default value is "10s"

TRANSACTION_SAMPLE_RATE

If you are receiving a lot of data than what you can store then consider changing the sample rate of the transactions.

By default, the data of all transactions are reported to the APM server. You can change the value of this parameter to sample only a percentage of transactions.

Eg — A value of 0.8 means to send the data of only 80% of the transactions to the APM server. The default value is 1.0

CAPTURE_HEADERS

If you don't need to store the headers of your HTTP requests then you can disable their collection. This will take off some load from the APM agent and server.

The default value is "true"

SERVER_TIMEOUT

This specifies the timeout for requests to the APM server. If your APM server is under a heavy load then the agent may not be able to establish the connection quickly.

If you get exceptions like “Server timed out” or “connection failed to APM server” then consider increasing this value.

The default value is "5s" (seconds)

APM Server Configuration

Now that you are done tweaking the APM agent config let’s look at the changes that we can make to the apm-server.yml configuration file of the APM server.

You can run locate apm-server.yml to find the location of this file on the instance on which your APM server is running.

If your APM server cannot keep up with the rate at which the agents are sending events then you can tweak the following parameter s—

apm-server.max_event_size

This denotes the max. size of a single event that can be processed by the APM server. If you observe an exception related to the max. event size then increasing its value may resolve it.

The default value is 307200 bytes.

apm-server.idle_timeout

This denotes the max. amount of time to wait for the next incoming request before the underlying connection is closed.

You can increase this value to limit connection failures to the server. The default value is 45s (seconds)

apm-server.read_timeout | apm-server.write_timeout

They denote the max. duration for reading an entire request and writing a response.

Their value can be increased like the idle timeout to minimize connection failures. They are particularly useful when your APM server is under a heavy load and cannot process incoming requests quickly.

The default value of both the parameters is 30s (seconds)

queue.mem.events

If the rate of events becomes higher than the speed with which Elasticsearch can process them then events can be queued. Higher values prevent events to be lost but may take a large amount of RAM during high traffic.

This is one of the most common parameters of APM that you will be changing according to the load on your APM server.

The more servers you integrate with APM, the more number of events the APM server will receive. If you don’t want to scale up your Elasticsearch cluster then you can increase the queue size to hold the events temporarily at the APM server.

The default value is 4096 which is quite low.

If your Elasticsearch cluster cannot keep up with the rate at which the APM server is receiving events then you can tweak the following parameters —

output.elasticsearch.worker

It represents the number of APM processing workers per Elasticsearch host. More workers prevent the APM queue from filling if Elasticsearch can keep up with indexing.

The default value is 1

output.elasticsearch.bulk_max_size

It represents the maximum number of events to bulk in a single Elasticsearch bulk API index request.

It’s always recommended to send APM events in bulk to Elasticsearch however the default value is just 50.

Consider increasing this to a significant number depending on your load.

output.elasticsearch.timeout

If requests from the APM server to Elasticsearch are failing then consider increasing its value.

The default value is 90

Here’s an example configuration file —

apm-server.yml

Notes

You can set the logging level to “error” if you don’t have compliance issues to log only errors in the apm server log file. By default the log level is “info” and all the API hits by the APM agents are logged.
If you are receiving 1000s of API requests per minute then you may observe log files taking 10s of GBs of space in a single day.
It’s recommended that you enable x-pack monitoring for APM to monitoring your APM server in Kibana’s stack monitoring page.
You can observe some pretty cool metrics there like — Request rate, System load, CPU and Memory Utilization, Processed Events rate, etc.

Optimizing Elasticsearch

There are a few changes that you can make to Elasticsearch to optimize it for APM.

Set appropriate JVM Heap Size

Elasticsearch is built on Java and runs on JVM. You can limit how much RAM Elasticsearch can consume from jvm.options file.

You can run locate jvm.options to find the location of this file on the Elasticsearch node.

It’s recommended to set this value to half of the server’s RAM.

For Example — If your server has 32 GB of RAM then you should allocate 16 GB of RAM to Elasticsearch.

# You should always set the min and max JVM heap
# size to the same value.# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space-Xms16g
-Xmx16g

The rest of the RAM can be used by the OS, Kibana, and Elasticsearch for searching.

Set Number of Replicas

The number of replicas directly affects the indexing speed of an index in Elasticsearch. The less number of replicas you have, the faster is the indexing.

If you have only a single Node in the Elasticsearch cluster then there is no need for replicas and you can set them to zero.

You can set this using the Elasticsearch APIs on the ES node.

curl -XPUT 'http://127.0.0.1:9200/_all/_settings?preserve_existing=true' -H "Content-Type: application/json"  -d '{
  
  "index.number_of_replicas" : 0,
  "index.auto_expand_replicas": "0-1"}'

Note — Only set replicas to zero if you have a single node in the Elasticsearch cluster.

Update Index Refresh Interval

As data is indexed in Elasticsearch it is available for search after some amount of time. By default this time is 1 second, i.e. after every second, the index is refreshed so that data is queryable.

Having a less refresh interval puts pressure on Elasticsearch and if your cluster is write-heavy instead of read-heavy then you can set this value to something like 30 seconds. So any data will be available for searching after 30 seconds of being indexed.

curl -XPUT 'http://127.0.0.1:9200/_all/_settings?preserve_existing=true' -H "Content-Type: application/json"  -d '{  "index.refresh_interval" : "30s}'

The above API calls apply the changes only to the existing indexes and not to future indexes. To apply these changes to future indexes as well you can create a template like this —

curl -XPUT "http://127.0.0.1:9200/_template/zeroreplicas" -H "Content-Type: application/json" -d '{    "template" : "*",
    "settings" : {
      "number_of_replicas" : 0,
      "refresh_interval": "30s"
    }}'

Increase Max. File Descriptors

Elasticsearch uses a lot of file descriptors and when if it runs out of them then data can be lost hence make sure to always keep this value much higher than the average usage of ES.

You can check the settings of all of your nodes by running the following command in the ES console under the Dev tools sidebar in Kibana

GET _nodes/stats/process?filter_path=**.max_file_descriptors

To see the current file descriptor percentage usage run the following command

GET _cat/nodes?v&h=fileDescriptorPercent

If you observe a high usage then consider increasing this value in the elasticsearch.service file. You can find this file inside/etc/systemd/system or /etc/systemd/system/multi-user.target.wants folder.

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=120000

Setting up and optimizing Elastic APM isn't so simple. It takes time to observe your transaction patterns and tweak the configuration until the system becomes stable.

If your APM server or Elasticsearch cluster still cannot handle the traffic then consider scaling your APM server or Elasticsearch cluster using Vertical (increasing CPUs and RAM on the same server) or Horizontal scaling (increasing the number of machines running APM server and Elasticsearch).

Also, Elasticsearch is highly CPU and IO intensive so make sure your instance has enough CPUs for maximum performance. You may consider m5 or c5 instance family (for AWS) for these. Although Elastic Cloud uses io family for ES, they are quite costly.

On the other side, the APM server uses minimal CPU and may require more RAM because of the events queue. So, t3 and r5 instances (for AWS) could be a better choice for them.

If you have any further questions/suggestions feel free to leave a comment or reach me out at kunal.yadav (at) squadrun (dot) co. If you’ll like to work with us, check out squadstack.com/careers/ for more.

References

Thanks a lot for reading this article. If you liked it, please give a few claps so it reaches more people who would benefit from it!