Real-Time Log Analysis and Cost-Efficient Log Storage for a Data-Driven Future @ NoBroker

Vicknesh Rethinavelu
5 min readDec 22, 2023

--

NoBroker, a leading tech-driven real estate platform, embarked on a journey to revolutionize its logging infrastructure to accommodate rapid growth, enhance performance, and optimize costs.

We at Nobroker.in spending well our 10Lakhs spanning over 30+ instances just for logging and tracing data.

As our verticals expand their digital footprint, managing and analyzing massive amounts of log data becomes a crucial challenge.

The Need for a Modern Logging Infrastructure

NoBroker’s rapid expansion meant their logging infrastructure had to handle enormous amounts of data from various sources, including application logs, system metrics, and user interactions. Traditional approaches based on Elastic-Kibana stack, though popular, were beginning to exhibit limitations:

  1. Performance Bottlenecks: As data volumes surged, the Elastic-Kibana stack struggled to maintain real-time processing and analysis capabilities, leading to delayed insights.
  2. Scalability Challenges: Scaling the Elastic-Kibana stack was complex and often required significant hardware and infrastructure investments, resulting in higher costs.
  3. Cost Inefficiencies: Elastic-Kibana’s licensing costs and infrastructure expenses became a significant overhead, inhibiting resource allocation for other critical business functions.

We are using Filebeat, Elastic, and Kibana stack which served us well, Elastic performs the full-text search when the log ingestion was less, Now we have scaled more than 10X — this setup never matches our speed by any means.

This is one part of the story — we are ingesting more than 5000 lines of logs every second, you can see the ingestion latency and search latency

Even though pumping so much money and resources — the performance was not great, the data was not analytical ready and we could not see a large set of data — as it took more time or was rejected by Elastic search.

So we started to look for alternatives, which can balance both cost and performance.

While NoBroker ultimately chose Fluent Bit and ClickHouse for their logging infrastructure, we did experiments with other solutions like Loki, a popular TSDB developed by Grafana Labs. and We even tried using Google Big Query for a few use cases.

After carefully observing the clickhouse usage at Zerodha, Cloudflare, and Uber. We have seen the benchmark comparison of clickhouse against others ClickBench — a Benchmark For Analytical DBMS and its ever-going community ( 600+ PRs merged/month).

After spending a considerable amount of time researching clickhouse, we decided to give it a try for our logging infra to replace the elastic stack.

Fluent-bit: Tiny little log shipper

We have seamlessly replaced Filebeat with Fluent Bit, a lightweight and efficient log tailer. Leveraging a custom logging format with Fluent Bit’s parser and filter, we add tags to output JSON data, forwarding it to ClickHouse using the HTTP output plugin. We have also explored Vector for native output to ClickHouse, optimizing our log management process further.

We encountered challenges with high-rate and large-sized Nginx logs, and when we used Fluent Bit’s inotify in the tail plugin, we faced issues in continuous ingestion to ClickHouse. However, we successfully resolved the issue by replacing Inotify_Watcher with the older fs stat option, as detailed in this GitHub discussion: https://github.com/fluent/fluent-bit/issues/1108.

Storage Footprint:

We optimized our ClickHouse table by adding LZ4HC compression (level 9) — a significant feature — to reduce the storage footprint. Additionally, leveraging LowCardinality for data types notably enhanced query performance. To streamline log retention, we also implemented TTL expressions, to manage log expiration by ClickHouse seamlessly.

To consume and make the data usable — we created a dashboard using Redash — which is an open source and has good community support. https://redash.io/data-sources/clickhouse so we can plot our log metrics as a dashboard for business and analytical use cases.

We also use Grafana to monitor the clickhouse performance.

— -

We have been in clickhouse for a couple of months now, and so far we are ingesting as a JSON nested with log, already we can see the results are quite satisfying, we are getting a compression ratio as close to 10.

Elasticsearch has compression support from version 7.10 ( here ), but clickhouse scores high in terms of fetching the data also we couldn’t scale the file beat well.

Async write support from clickhouse along with fluentbit has helped us to achieve real-time parsing and ingesting, below is the insert rate in clickhouse.

a month-long span of read-and-write performance

References:

https://www.highlight.io/blog/how-we-built-logging-with-clickhouse

https://news.ycombinator.com/item?id=26316401

https://pixeljets.com/blog/clickhouse-vs-elasticsearch/

https://pixeljets.com/blog/clickhouse-as-a-replacement-for-elk-big-query-and-timescaledb/

https://altinity.com/blog/clickhouse-for-time-series

https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-queries-per-second/

https://pixeljets.com/blog/clickhouse-as-a-replacement-for-elk-big-query-and-timescaledb/

https://www.percona.com/blog/massive-parallel-log-processing-clickhouse/

https://www.uber.com/en-IN/blog/logging/

https://zerodha.tech/blog/logging-at-zerodha/

https://mrkaran.dev/posts/clickhouse-replication/

https://mrkaran.dev/posts/coredns-vector-clickhouse/

https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/

https://clickhouse.com/docs/en/sql-reference/data-types/lowcardinality

https://altinitydb.medium.com/reducing-clickhouse-storage-cost-with-the-lowcardinality-type-lessons-from-an-instana-engineer-8ac7f65b486b

https://altinity.com/blog/2019/3/27/low-cardinality

--

--

Vicknesh Rethinavelu

Likes to read about Technology, now trying to blog it. Devops Engineer. Traveler, Photo bug bite me too.