Optimizing Elasticsearch for High-Volume Data Ingestion: A Use Case at EXFO

17 min readJul 24, 2023

At EXFO, we are proud to be a leading provider of service assurance solutions for telecom operators worldwide. Our service assurance product is designed to help operators monitor, troubleshoot, and optimize their networks, ensuring top-notch service for their customers. One of the key features of our product is the ability to index and analyze massive amounts of CDR (Call Detail Record) data. This data, generated by telecom networks, contains information on every call, text, and data transfer that takes place on the network. By leveraging this data, EXFO empowers operators with valuable insights into the performance and reliability of their networks.

Introduction

At EXFO, we handle a large volume of CDR data and index it at a high rate using Elasticsearch, an open-source search and analytics engine. The sheer telecom scale of the data has presented several performance challenges along the way. Through implementing specific optimized settings and configurations tailored to our use case, we were able to overcome these challenges and achieve high performance and scalability in our Elasticsearch cluster.

During the implementation of Elasticsearch, we encountered challenges such as difficulty in identifying performance bottlenecks at first. However, we were able to optimize performance by utilizing techniques such as filtering, caching, index aliases, and batch replication. Additionally, we deployed our cluster on Kubernetes and utilized RAID 6 and a hot/warm/cold architecture to improve data durability while minimizing the required hardware.

As a result of these strategies, we were able to achieve high performance and scalability in our Elasticsearch cluster, allowing us to index and analyze CDR data in real-time at a rate of up to one million CDRs per second.

In this article, we will delve deeper into our use of Elasticsearch and share the technical considerations and optimized configurations that we used, including performance tests for designing and sizing an Elasticsearch cluster on Kubernetes for this workload. We will also discuss best practices for designing and scaling an Elasticsearch cluster for similar use cases.

1. Elasticsearch Cluster Design and Configuration

This section will explore the design and configuration of an Elasticsearch cluster for high-volume data ingestion. We will discuss the architecture, design and sizing considerations, hot/warm/cold architecture, configuration settings for fast query response times, the maximum number of shards limit in Elasticsearch per node, avoiding queries on the master node, document mapping, index shrinking, and Kubernetes deployment. We will also explain how we have implemented these techniques to optimize our Elasticsearch cluster for high-volume data ingestion.

Architecture

We have a Kappa standard architecture in place to effectively handle and analyze CDR data for telecom operators. It all starts with our network probes, which are deployed throughout the operator’s network. These probes capture various types of telecom network data and send it to our systems for processing in the form of CDR data.

Once the data is collected, it is sent to our Kafka cluster for storage and processing. We then use Apache Spark to fetch the data from Kafka and perform various transformations and enrichments on it. This transformed and enriched data is then stored in Elasticsearch, where it is indexed and queried.

Elasticsearch plays a crucial role in our process as it allows us to query the data in a flexible way across 100s of indexed columns. By utilizing this architecture, we are able to effectively handle and analyze massive volumes of CDR data at a high rate, providing valuable insights to our clients.

Designing and Sizing the Elasticsearch Cluster

When it comes to designing and sizing our Elasticsearch cluster for high-volume data ingestion, we take into consideration the specific requirements of the workload and the resources available on the cluster. We have deployed our Elasticsearch cluster on Kubernetes, which provides a flexible and scalable platform for running containerized applications.

One key consideration when designing the Elasticsearch cluster is the number of nodes, shards, and indices to use. The number of nodes in the cluster determines the amount of parallelism and the ability to scale horizontally. The number of shards determines the degree of parallelism in indexing and searching, and the number of indices determines the number of parallel indexing and searching operations that can be performed. It is important to strike a balance between these factors to ensure optimal performance and scalability.

To get a precise sizing of the Elasticsearch cluster, we conduct performance tests to know how many documents can be indexed per second on a given shard and on specific hardware. Each data set is unique, and the number of indexed fields per document and the data type of these fields have an impact on the ingestion performance. To ensure proper sizing of the cluster for the target input rate and allow for many concurrent queries while indexing, we start with a small cluster and push it to the limits.

Hot/Warm/Cold Architecture

The next generation of our solution uses a hot/warm/code data architecture. A hot/warm/cold architecture is a way of organizing data in an Elasticsearch cluster to optimize for different access patterns. In this architecture, “hot” data is data that is accessed frequently and requires low latency, “warm” data is data that is accessed less frequently and can tolerate slightly higher latency, and “cold” data is data that is accessed infrequently and can tolerate higher latency.

In a high-volume data ingestion scenario like EXFO’s, a hot/warm/cold architecture can be used to improve the performance of the Elasticsearch cluster. For example, CDR data that is accessed frequently could be indexed in the “hot” tier, while older CDR data that is accessed less frequently could be moved to the “warm” or “cold” tier. This would allow the cluster to prioritize the indexing and searching of “hot” data, while still providing access to “warm” and “cold” data when needed.

This can also significantly reduce the hardware cost as we do not need powerful hardware for warm and cold data tiers, but only a lot of disks.

Configuration for Fastest Query Response Time

We understand that the configuration settings play a crucial role in ensuring the fastest possible query response time in an Elasticsearch cluster.

To achieve this, we have found that using the appropriate number of shards and indices is crucial. We use enough shards to parallelize the search process, but not so many that the search process becomes inefficient.

For each index, we use one primary shard on almost all Elasticsearch Data nodes, this allows us to use all the processing power available on the cluster for indexing while keeping the number of shards to a minimum. We only allocate one of the primary shards of a given index per data node, in order to evenly spread the processing needs. However, we do not place a primary shard on all data nodes, this allows us to still create an index with all its shards on the Elasticsearch cluster, even in the case of a node or host failure. By only using 95% of the Elasticsearch data nodes for a given index, we allow for up to 5% of Elasticsearch data node failure, while still maintaining the ability to create an index with all its shards. Similarly, we use enough indices to parallelize the search process, but not so many that the search process becomes inefficient. On some large deployments, we also switched from hourly indices to multi-hour indices to further reduce the total number of shards on Elasticsearch. This is explained in more detail in the section of this article titled ‘Maximum Number of Shards Limit in Elasticsearch per Node’

We also make use of filtering to narrow down the search results and improve query performance. By using the appropriate filters for the specific search requirements, we have been able to achieve faster query response time. With these configuration settings, we can maintain high performance and scalability in our Elasticsearch cluster and provide our customers with the best service possible.

Maximum Number of Shards Limit in Elasticsearch per Node

When it comes to managing our Elasticsearch cluster, one of the key considerations is the number of shards assigned to each node. While there isn’t a hard limit on the number of shards, it’s important to keep in mind that each shard requires a certain amount of heap space to store data.

If we assign too many shards to a node, it can become unstable or crash. Additionally, assigning too many shards to a node can lead to resource contention and negatively impact performance.

We keep a close eye on resource usage and adjust the number of shards as needed to ensure optimal performance. One approach we use to size our Elasticsearch cluster is the “20 shards per GB of RAM” rule.

This rule suggests that the total number of shards in the cluster should not exceed 20 shards per GB of RAM on the Elasticsearch data nodes.

Maximum number of shards on ES cluster (primary + replica) (recommendation):

NB_ES_DATA_NODES x ES_DATA_HEAP_SIZE x 20

NB_ES_DATA_NODES => number of ES Data nodes for the ES cluster

ES_DATA_HEAP_SIZE => size of JVM heap for each ES Data nodes, not to be confused with ES pod size. ES Pod size is usually twice the ES JVM size. ES JVM size should not be higher than 32GB, so do not set it higher with 31Gi in container spec. Pod request/limit should be 2x the JVM size.

20 is the max recommended number of shards (primary or replica) for each GB of ES Heap.

Note: ES (real-time or batch) replication double the shard numbers.

Shard number is linked to required shard for ingestion (usually one shard (or less) per ES Data per index), the data retention period, the index time period (hourly or multi-hours) and number of index types.

This is based on the idea that Elasticsearch heap size should be about 50% of the available RAM on each node, and each shard requires about 250–300 MB of heap space. For example, if a node has 16 GB of RAM,

the recommended maximum number of shards (primary and replica) for that node would be 320 (16 x 20 = 320).

This means that if the cluster has a total of 320 shards, each node in the cluster should have at least 16 GB of RAM to ensure that the Elasticsearch heap has sufficient space to hold all of the shards. It’s important to keep in mind that the “20 shards per GB of RAM” rule is just a rough guideline, and the actual number of shards that can be supported on a node will depend on the specific workload and resources available on the node.

We constantly monitor the Elasticsearch cluster and adjust the number of shards as needed to ensure optimal performance. The number of shards we use is dependent on the volume of CDR data that needs to be indexed, the hardware resources of the Elasticsearch nodes, and the query workload we expect. We strive to strike a balance between having enough shards to process the input traffic and not having too many shards that create too much overhead.

One technique we have explored to optimize the performance and scalability of our Elasticsearch cluster is the use of the rollover feature. The rollover feature allows for the management and limitation of the total number of shards on the cluster by optimizing the size of each index and associated shards. This helps to ensure that shards are of an optimal and recommended size. However, due to the specific requirements of our application layer, we were unable to utilize the rollover feature. Our application layer heavily relies on index naming to optimize search queries, specifically by selecting the appropriate indices based on a specified time range. This constraint made it impossible for us to implement the rollover feature without disrupting the performance and innerworkings of our application layer.

Optimizing Query Performance: Avoiding Queries on Elasticsearch Master Node

As we embarked on our project, we encountered an opportunity to fine-tune the query performance in our Elasticsearch cluster. During this phase, we observed a challenge with data nodes not accurately reporting to the elected master node. As we delved deeper into the issue, we recognized that the problem was more pronounced when an increasing number of queries were directed to the Elasticsearch cluster. It was at this juncture that we realized the significance of an optimized query distribution strategy, particularly when deploying Elasticsearch on Kubernetes.

One well-known recommendation is to configure dedicated master-eligible nodes, exclusively assigned with the master role. By adhering to this best practice, we ensure that the master node remains unburdened by additional tasks, allowing it to focus solely on cluster management. This approach becomes even more pivotal in the early stages of a project when fine-tuning performance is of utmost importance.

To address the challenge and improve query performance, we refined our approach. Rather than sending queries directly to the master node, which is primarily responsible for cluster coordination, we redirected our queries to the data nodes. These nodes are specifically designed to handle search operations efficiently, resulting in improved query response times. Additionally, we leveraged load balancing techniques, such as round-robin or least connections, to distribute the query workload evenly across the search nodes.

This strategic adjustment yielded remarkable results. We experienced a substantial boost in both the performance and stability of our Elasticsearch cluster. Queries executed seamlessly, and the cluster effortlessly handled the workload. This positive outcome underscored the significance of adhering to the best practice of avoiding queries on the Elasticsearch master node, particularly when deploying on Kubernetes. By optimizing our query distribution strategy, we harnessed the full potential of Elasticsearch, maximizing its search capabilities and delivering an exceptional user experience.

Document mapping

At EXFO, we have a specific use case where we need to allow search on all fields, which means that reducing the number of indexed fields is not an option for us. However, we have still implemented optimization strategies to improve ingestion performance and reduce hardware requirements. One such strategy is to ensure that the data type used for each field matches the field value type. For example, we select the lowest numerical type possible based on the min and max values, and we also select the right field type for alpha-numerical values based on the queries we want to perform. This allows us to optimize query performance while still maintaining the ability to search on all fields.

Index Shrinking

At EXFO, we’ve found that implementing index shrinking in our Elasticsearch cluster has helped to optimize query performance. By reducing the number of shards in an index, we’re able to improve the speed of queries by searching fewer shards. One way we’ve achieved this is by combining multiple shards into a single shard and reducing the number of Lucene segments within each shard to one.

However, it’s important to note that during the index shrinking operations, the impact on performance on the Elasticsearch cluster can be quite noticeable. To mitigate this, we’ve shifted our approach from creating new indices on an hourly basis to creating them on multi-hour intervals. This has helped to minimize the impact of index shrinking on the overall performance of our Elasticsearch cluster.

Kubernetes Deployment

Migrating our Elasticsearch cluster to a Kubernetes-based deployment has been a significant step forward for us. Initially, our solution was running on a Hadoop cluster, with Elasticsearch running separately on virtual machines. However, as we transitioned to a cloud-native architecture, we decided to move Elasticsearch to Kubernetes. This has allowed us to greatly improve the scalability and reliability of our cluster, as we are now able to leverage the power of container orchestration in ways that were not possible before. Kubernetes has allowed us to easily scale up or down the number of Elasticsearch nodes as needed, which has greatly improved the overall performance of the cluster. Additionally, features such as automatic failover, rolling updates, and self-healing have helped to ensure that our Elasticsearch cluster is always available and performing at its best. Managing and updating the cluster has also become much easier thanks to Kubernetes’ ability to handle deployment and configuration of the Elasticsearch nodes. Overall, using Kubernetes has been a huge benefit for EXFO in terms of managing and optimizing our Elasticsearch cluster.

2. Elasticsearch Cluster Performance

Elasticsearch is a powerful tool for indexing and querying large volumes of data in real-time. However, to achieve high performance and scalability, careful consideration must be given to the hardware requirements and the strategies used to optimize the performance of the cluster. In this section, we will explore the performance challenges we faced when setting up our Elasticsearch cluster, the methods we used to identify and troubleshoot performance issues, and the strategies we implemented to achieve high performance and scalability.

Batch Replication vs Real-Time Replication

Designing a system for high-volume data ingestion requires careful consideration of the ingestion method, whether it’s batch replication or real-time ingestion. Batch replication involves replicating data in batches at regular intervals, while real-time replication involves replicating data as it is generated. The primary focus of using replication is to improve availability, resilience, and fault tolerance in the system. However, it also has an impact on the hardware footprint requirements, as real-time replication may require more resources compared to batch replication. We’ve chosen to implement our own batch replication to index CDR data to minimize the impact on the network and reduce the risk

of data loss. However, this approach can lead to some data loss if CDRs are generated between batch intervals.

To mitigate this risk, we’ve implemented a process to replay missing data from Kafka using timestamps. With Kafka, you can replay missing data from a specific timestamp, which can help to ensure that all data is indexed in Elasticsearch. We use the Kafka Consumer API to read the data from Kafka and the Elasticsearch Bulk API to index the data into Elasticsearch. The process works by reading data once again from Kafka starting at a specific timestamp and indexing all of the data that was generated after that timestamp. This allows us to ensure that all of the lost data is indexed, even if some data was not yet replicated due to the batch replication process running at regular intervals. It’s a process that we’ve fine-tuned over time to make sure that we’re able to handle high-volume data ingestion with minimal data loss, in case of a node failure, by ensuring that all data is indexed and replicated.

Use caching

Optimizing query response time in Elasticsearch is essential for providing a satisfactory user experience, especially when high volumes of data are being indexed and queried in real-time. There are several factors to consider when striving for the fastest query response times, including the size of the Elasticsearch heap and the allocation of memory for the Operating System cache.

One common rule of thumb is to allocate 50% of the available RAM to the Elasticsearch JVM heap and the remaining 50% to the OS cache. However, the optimal ratio may vary depending on the specific use case. We have found that a ratio of 1:124, with 1 part RAM to 124 parts disk storage, provides a good balance between hardware requirements and query performance. This ratio was determined through

performance testing, which involved indexing data at a high rate and concurrently querying the data with multiple queries to measure the response times.

It’s important to note that the type of disk storage used can also impact query response times. While SSDs tend to be faster than HDDs, they can also be more expensive, especially when storing large volumes of data. Careful consideration should be given to the cost versus performance trade-offs when deciding on a storage medium.

Another optimization strategy is the use of index aliases, which can be used to abstract away the underlying indices and simplify queries. This can make it easier to perform maintenance on the indices. By carefully evaluating all these factors, and conducting performance testing, we were able to achieve fast query response times while minimizing hardware requirements.

3. Elasticsearch Cluster Availability and Reliability

Elasticsearch can provide fast and reliable data storage and retrieval for a variety of applications. To ensure that our customers can access their data with confidence, EXFO has implemented a number of measures to ensure the availability and reliability of our Elasticsearch cluster. One of these measures is the use of RAID 6, which provides an extra layer of redundancy and improves the write throughput performance of the cluster. In this article, we will discuss the benefits of RAID 6 for our Elasticsearch cluster and how it can help to ensure the availability and reliability of our data.

High-Availability

At EXFO, we understand that our solutions are mission-critical for many of our customers, which is why ensuring data durability is of the utmost importance.

One of the ways we do this is by implementing host awareness in our Elasticsearch cluster. Even though data replication (real-time or batch) is a fundamental feature of Elasticsearch, it is not enough to avoid data loss. As we have multiple co-located Elasticsearch data nodes per host, we must ensure that we do not end up with a primary shard and its replica shard on the same host.

By using host awareness, we can prevent a single node or host failure from resulting in data loss. In the future, we plan to implement rack awareness in our deployments where we have access to information about the host placement within racks. By implementing these measures, we can increase the overall data durability of our Elasticsearch cluster.

Why RAID Can Help in Such a Solution

RAID (Redundant Array of Independent Disks) is a technology that we use to improve the write throughput performance and reliability of our Elasticsearch cluster. By using RAID, we can parallelize disk operations and provide an additional layer of protection against disk failures, which can improve the speed at which data is written to disk and ensure the availability of the data.

In a high-volume data ingestion scenario like ours, using RAID 6 is particularly beneficial as it provides an extra level of redundancy. This is especially important when using HDDs (hard disk drives) as the primary storage medium, as HDDs are more prone to failures than SSDs (solid-state drives). With RAID 6, we can withstand the failure of two disks in the array without losing any data. This helps to improve the reliability and availability of the data, ensuring that it is always accessible and protected against potential failures.

In addition to improving reliability, the use of RAID also allows us to achieve a much higher write throughput per host. By writing to multiple disks at the same time, we can take advantage of the combined write speed of all the disks in the array. This can significantly improve the write throughput of the Elasticsearch cluster, as it allows us to process data more efficiently.

Overall, the use of RAID has been a key factor in our ability to achieve high performance and reliability in our Elasticsearch cluster. By parallelizing disk operations and providing an additional layer of protection against disk failures, we have been able to significantly improve the write throughput and reliability of our Elasticsearch cluster. This has helped us to optimize the indexing process and ensure the availability of the data, resulting in a more efficient and reliable service for our clients.

Conclusion

In conclusion, Elasticsearch is a powerful tool for indexing and searching large volumes of data, and it has been effectively implemented at EXFO to handle the high volume and rate of CDR data. By carefully designing and sizing the Elasticsearch cluster on Kubernetes, using batch replication and timestamps to replay missing data from Kafka, and following configuration and options for optimizing query performance, we were able to achieve high performance and scalability in our Elasticsearch cluster. The use of RAID 6 and a hot/warm/cold architecture has also helped to improve the reliability and availability of the data.

It is important to note that this is an ongoing process, and we are still actively working to improve further the ingestion rate as part of our continuous improvement strategy.

While we have shared our experience and the solutions that have worked for us, it is important to note that every use case and environment is unique, and what works for us may not work for others. It is crucial to carefully plan and design the Elasticsearch cluster based on the specific requirements of the workload and to monitor and adjust the cluster as needed to ensure optimal performance.