At Tictail, a marketplace and e-commerce platform, Insights is a built-in analytical tool which provides shop owners with a detailed view how their shop is performing on Tictail.
When we set out to launch a new feature, we wanted to show content based on data collected in the backing service of Insights.
However, we noticed that performance was an issue. The 95 percentile latency was 400ms for requesting the data that we were interested in. Our goal is to keep 95 percentile latency below 100ms for user facing requests.
The backing service of Insights utilizes Elasticsearch to query the collected data. Naturally, we set out to see if we could optimize how we query Elasticsearch.
We index documents in time-series indexes in Elasticsearch using Logstash. There is one index per document type and week.
Every incoming request to the service specifies a time span. From the time span, and any additional arguments, we construct a query that we send off to Elasticsearch.
In Elasticsearch, one can search across multiple indexes by enumerating the index names as comma-separated values, or by wildcard expressions. Elasticsearch even supports naming indexes by date math expressions.
One of the first things we noticed was that we searched across multiple indexes using a wildcard expression. The wildcard expression matched all indexes for the requested document type.
For our particular use case, we wanted to retrieve data for the last thirty days. It would suffice to search five weekly indexes rather than all.
We rewrote the logic to take the supplied time span as an input for calculating which weekly indexes that we need to search. Since all other queries in the service also used a wildcard expression for searching their respective indexes, we rewrote the logic for them as well.
When the changes had been rolled out to all instances in production, we observed a 95 percentile latency of 90ms for requesting our data. A reduction of 77.5% from the previous 400ms.
Searching across fewer indexes yielded less queries in the Elasticsearch cluster:
During the query phase in Elasticsearch, the coordinating node who receives the search request forwards it to a copy of every shard in the index. As we reduced the number of indexes in our request, less requests had to be forwarded. The cluster load decreased and throughput increased.
As a result, all HTTP endpoints in the service saw a wast improvement:
As can be seen in the graph above, not all HTTP endpoints serves requests within our goal of 100ms. There is still room for improvement.
Why did we not notice the performance issues earlier, and how can we prevent it from happening again?
Performance was not an issue in the beginning. The number of indexes was small, and adoption of Insights was low. As we moved on to new priorities, we did not notice that performance was becoming an issue as more indexes were being added.
Since then we’ve added monitors that triggers on performance anomalies for the particular service.
In any new initiative we launch, we should have a process that makes sure that key performance metrics are being monitored.
Would you like to be a part of our community and help build a global marketplace where independent brands can reach millions of shoppers?
Check out tictail.com/careers