Log Management on LOGIQ vs ELK Stack

Amit Ashwini
Mar 9 · 4 min read
Log Management on LOGIQ vs ELK Stack

Log Management on LOGIQ vs ELK StackMachine log collection, management, and analytics are essential for a company to thrive in the information age cloud business. I recently learned how a top-performing cloud company architects its log management infrastructure. The corporate log service underlying evolves continuously based on the need, and it generates critical business values. This write-up first describes the log system implementation on AWS. It then discusses the architecture decision to address resource limitations, control the operation budget, and reduce operation disruption.

Centralize log management is always a preferred solution for handling enterprise logging. Application, IT system, and cloud micro-service logs are collected and managed in one central location, for example, AWS EKS.

DevOps and automation teams play a central role in developing and maintaining the log management infrastructure. Log collections and analytics data preparation phases are all in place. See [1]. As nice as the system seen from high ground, there are engineering overheads to address the infrastructure’s computing, storage, and budget resource limitation.

The ingestion pipeline first filters the log data by extracting the useful log portion to reduce logging noises — what logs or log portions to retain or remove. The trimming directives usually come from the log data end-user, such as a data scientist or system analyst. They communicate with the DevOps/automation team to create customized log extraction filters for deployment. The goal is to control the logging volume size and content. The DevOps/Automation team builds the ad-hoc best-effort log filters, and the process is laborious and error-prone.

This log reduction filter control is in place for the log data infrastructure limitation. In this example, it uses Amazon elastic search service (EKS). Both the compute and storage resources for indexing needs to be checked for a healthy logging system. To maintain a performing stable operating state, the total number of ingesting logs is controlled at 30GB/day, see [8], and total backlogs are retained for two weeks. Data are backed to an economical Amazon S3 storage after that.

Ingested log data are critical for company operations. Different business functional units for the company deals with different log data and different usage. For example, the performance and capacity team extracts metrics from logs to model system usage trends and forecast future demand. The customer business unit would extract metrics to analyze customer insight and creates business values, for example, customer churn. The DevOps would maintain the system’s operating state over SLA (Service Level Agreement) requirement. The log data metrics extraction and the subsequent analytics are highly customized, flexible, and fluid. Each functional unit is specifically created to solve specific business problems. It is highly desirable to utilize AI/ML techniques and methodology, see [7]. For example, holistically, log ingestion data pipe can be enhanced with tag or label to facilitate later AI/ML analysis. All the mutable fields in the log are automatically extracted for analysis. This subject will be revisited in another blog post.

As mentioned earlier, the deployed Amazon ES operating resource needs to be monitored and controlled for a performing log system. Amazon ES infrastructure can scale-out, but the effort is not walk-in-the-park, and it often requires some degree of trial-and-error. The new ingesting log data capped at 30GB/day with a 2-week log data retention time for maintaining the operating budget and system performance. The system always holds about 500GB of new log records for processing. The overflown log is backup into Amazon S3 at $25/TB-month for an indefinite period. The AWS hosting of such a setup is around $60k/year, not including engineering and operator costs.

Here is a similar log infrastructure setup using LOGIQ building blocks. The system now becomes simpler because addressing resource limitation is alleviated with the use of S3 storage. See figure below,

LOGIQ log management infrastructure removes the engineering build-in infrastructure overhead. The new construct is efficient and straightforward. The table below lists the infrastructure resource engineering overheads, and the list tags are from an earlier drawing

In summary, this article describes an operating log infrastructure setup on AWS. It also presents a similar LOGIQ log infrastructure setup. LOGIQ log management infrastructure removes induced engineering overhead because of its competitive advantage in the native use of S3 storage.

There are plenty of references on the web about scaling elastic search service and the common consensus from these references is such a task is not for the faint heart.

Originally published at https://logiq.ai.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store