Real Time Log Management System

It’s really important to know what’s happening in your application. The most common example of this is being notified if anything is going wrong; from infrastructure issues like CPU, memory or disk usage through to in-application warnings, errors and exceptions. These events are highly important as they may directly impact your business and customers. 
We know that when any event takes place with an application there is an associated data payload and therefore real-time data. This information may be available in a number of ways, but one of the most common way is through some sort of logging solution: system stat logs, access logs and application logs.

For that, a real time log management system is required. To break it down into steps —

  • Ingest — How to bring in all the logs
  • Index and Querying — Efficient storage and Unified queries
  • Wiring it up — How data flows through the system


Data ingestion is the first step before analytics, predictive modeling or reporting. Ingesting data fast, as well as standardizing it, is a real challenge. 
The core of a real time log management system is scaling and resiliency and Kafka, ingesting unit, is one of the best in those.

  • No single-point-of-failure: Kafka allows age based retention of data.
  • Performance: It can process 33k messages/sec with a 4 node cluster and 25 threads compared with RabbitMQ 3k messages/sec with 4 node cluster. Refer here.
  • Scalability: Its ability to increase the partition count per topic and downstream consumer threads provides flexibility to increase throughput when desired.

For more benchmarks over Kafka, visit here.

Index and Querying 
All this data needs to sit somewhere. Preferable in a database that will easily scale as data needs grow. And the more analytics-like querying that database supports, the better. The obvious approach to store log data is to use Time-Series Database (TSDB). If you are new to the concept of TSDB, read here why TSDB performs better than relational databases. 
Cassandra seems like a very idea choice for storing logs and querying.

  • Architecture: Cassandra has a distributed architecture with homogeneous nodes. Cassandra uses the concept of a commit log for writes to ensure data durability. The writes are stored in an in-memory structure called memtable and also appended to the commit log on the disk. The contents of memtable are eventually flushed to the disk in a SSTable data file.
  • Data Ingestion: Cassandra offers a variety of methods for data ingestion depending upon the size of the data, such as, COPY FROM command for ingesting CSV data, sstableloader for loading large amounts of data efficiently and BulkOutputFormat for streaming Hadoop data.
  • Querying: Cassandra is a hybrid between a key-value and a column-oriented database management system. The data model is a partitioned row store with tunable consistency. Cassandra provides a query language (CQL) which has a similar syntax to SQL.

For benchmarks over Cassandra, refer here.

Wiring it up
The figure shown below describes the wiring up of the all components required to build a real time log management system.

All the applications generate logs. There are two ways to push them into queuing system. First one is to dump them into a log file and keep on tailing to the log file for pushing every new line as message into the queue. This approach is very inefficient as logs are first written to the log file and then tailed and pushed to queue. Second approach is to override logger and push the logs as soon as they are created and also dump them in separate log file in parallel. Kafka producers add up to 3ms of delay while publishing to queue and it can be minimized further with some configuration changes such as changing acks and config. Disadvantage of using this approach is additional latency in application code.

Once published to queue, spark streaming keeps listening for messages. This layer acts as processing and alerting engine. It will validate the messages read from the queue and push them to their corresponding location. In this system, processed messages will be indexed in Cassandra and also stored in S3 for persistent storage. If any anomaly is detected in messages, it can notify the alerting system and store them separately for further processing.

Cassandra helps with querying over logs and providing faster but limited analytics. If more advanced analytics is required, another spark job can be created over S3, persistent storage.

Any suggestions or thought, let me know:
Insta + Twitter + LinkedIn + Medium | @shivama205