Realtime Big-data Analytics in Incident Response: Practical Review of Streaming and an Automated Data Warehousing

Published in

Analytics Vidhya

6 min readDec 14, 2019

I have been working for the last two years for choosing the right technology of big-data for incident-response. A significant reason to discuss real-time analytics in incident response is the necessity of real-time detection and protection of cybersecurity threats. A real-time detection eventually becomes substantial issues when several real-time data telemetry, logs, and time-series are pulling together from sensors, network and security devices, or SIEM into an analytics dashboard.

Many organizations today were heavily relied upon security incident event management (SIEM) to monitor incidents and threats. Although, most of the SIEM stacks have modernized with socket, connector, and filter, yet it still lacks maintaining historical and real-time streaming data. Moreover, scalability, elasticity, and data life-cycle management become a vital issue when data size is increasing exponentially.

This article thought on data warehousing to oversee organizations to choose worth technology, tools, architecture to start considering current incident response architecture so that they have an opportunity for improvement for making cyber-security threats detection more adequate. Importantly, this article only focusing on how real-time analytic architecture looks like, which was not much filling my needs during the research. Instead of talking in a broad subject of incident response framework, this writing covers just from a practical viewpoint of big-data and data-warehouse. It also addresses needs as an ‘out-of-the-box” approach since, from my experiences, many vendors and technology suppliers were notably focusing on how the big-data and data-warehouse architecture on their own. They persuade a user with subscription, feature change request, which later ends; up with “vendor lock-in.”

Eventually, the reason I wrote this article inspired by Chris Riccomini talks about “The current state-of-the-art in data pipelines and data warehousing” on InfoQ [https://www.infoq.com/presentations/data-engineering-pipelines-warehouses/]

Overview

Conceptually, an incident response came from an IT organization approach to managing and addressing IT-related security incident to quickly handle the situation in a way to limit the aftermath and reduces recovery time and costs.

On the other hand, a big-data, which is my primary focus of this writing; is referring to a large, diverse set of information that grows at very significant rates, especially on the revelation of the internet, IoT, and social media. Also, big-data carries a vast amount of data, the velocity and the rapid increases it is created and collected, and its variety of the sources/formats to retrieve and consume. Consequently, data-warehouse is referring to a system that retrieving data from many different origins for reporting, analysis, decision making, and incident response.

Requirements

When it comes to practical consideration, tearing down the architecture into phases helps to make the architecture more reasonable. Firstly, it started with a legacy approach of IT-related incident response, secondly is big-data warehousing; and finally, a mix of big-data warehousing with an automated batch processing, data streaming, and data warehouse management. This combination has become a core fundamental of a big-data warehouse. From this point, then we can start selecting the right architecture for building our data warehouse in purpose.

First Phase(failure!)

To get started, let me explain why centralized architecture is a failure to support today’s IT needs. Although it is a simple and easy deployment, the increasing number of users, services, and demand for real-time analysis. The telemetry data are coming sensors that would store directly into a relational database (RDBMS) or columnar database. Both databases can use the legacy SQL language to load the data and perform a query from the front-end application. It has become standard de-facto in a legacy ecosystem to serve data for ETL and data visualization. However, this phase typically has already obsolete since it cannot support the incremental grow of data size and users. At the same time, the issues of simultaneous users who perform SQL query, accessing the dashboard causes slow performance of the system is inadequate. In addition to that, the real-time response of threat detection becomes slower when data is pulling more frequently from the sensors as shown in figure 1;

Figure 1: Centralized Monolithic Architecture

Second Phase

In this phase, adding more data replication nodes on both databases would not be prolonged as data size, and the user is prevailing. Although the solution was to deploy Redis for improving SQL query, and ETL performance and then deploying Debezium [https://debezium.io/], a distributed platform for change data capture and stretched into a decentralized solution, the performance improvement would not much as it has expected. Nevertheless, we still dealing with data latency, data pipelining issues, and complicated workflow as the number of nodes are started to grow.

Figure 2 : Architecture of Distributed Analytic

Third phase

At this stage, I started to add batch, real-time automation into architecture. This approach would address data latency, pipeline, and complicated workflow. Still, when it comes to these growing data size, the number of users, services, and apps, some challenges are now unfolding. These are the issues you would see;

Real-time response threat detection
Data Latency
Data Pipeline Issues
Complicated Workflow
Data debugging issue
Data healthy
Compliance

As more and more applications, services become distributed, the complexity is now becoming prevalent is shown in figure 3.

Figure 3 — Source: https://cdn.confluent.io/wp-content/uploads/etl_mess-768x559.png

Solution

Integrating various tools and solutions into architecture makes architecture far complicated, especially in data governance, compliance, and regulation. The motivation to break down architecture is to separate big complicated things so that we can deploy various tools and solution started from an easy one to the complex one. Here they are;

Apache Kafka for asynchronous messaging
Apache AirFlow for managing complex workflow
Data Analytics
Data Catalogs with lyft — Amundsen
Role-Based Access Control
Automated Data Management
Cloudera CM for managing Hadoop infrastructure
Streamsets for Data Pipeline, Data Masking and Management

Tools

Here some tools can be managed by Cloudera CM, follow with a standalone solution for DWH orchestration and ELT deployed inside the cluster;

Apache Kafka
Apache Airflow
Apache Hadoop
Apache Hive
Apache Cassandra
Apache Sentry
Apache Arrow
Apache Ranger
Terraform for Data Warehouse Orchestration
Dremio
MySQL
MongoDB

Finally, this is the architecture of distributed real-time analytic shown in Figure -4. The architecture consists of four layers, where each component is decoupled and works as an independent module relevant to each layer.

Architecture

Messaging Service layer: This service function takes care of the data ingestion point. The structured, unstructured, and time-series data of sensors, network security devices, Siem, IoT, or apps push their data into Hadoop, RDBMS, and TSDB through Kafka; as a topic. This layer often holds raw information in a database while the others have streaming instead.
ELT layer: This layer takes care of data extraction, load, and transforms. It is ELT instead of ETL to happen in the messaging service layer. This layer responsible for data lifecycle management, batch, streaming pipeline, data masking, and monitor run-time process performance.
Data Warehouse layer: This is a part where I came from to design a decentralized data warehouse in Big-data. It is a mix of legacy DWH and Distributed storage in a Hadoop ecosystem. Taking advantage of decentralized architecture in a Hadoop makes architecture more reliable, and elasticity in terms of automation and scalability.
Analytic Layer: It holds responsibility in the measurement of how vital and valuable in the information we have collected in Data Warehouse.

Conclusion

Big-data and data-warehouse have become major enablers toward real-time distributed-big-data-analytic, especially in incident response. Although proposed architecture might not work in other ecosystems, yet it might be more beneficial as a kick-starter in the next experiment of my incident response analytics toolkit. I hope this thought worth sharing for everyone.

References

[1] N. Srivastava and U. Chandra Jaiswal, “Big Data Analytics Technique in Cyber Security: A Review,” 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 2019, pp. 579–585. doi: 10.1109/ICCMC.2019.8819634 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8819634&isnumber=8819605

[2] X. Jin, Q. Wang, X. Li, X. Chen and W. Wang, “Cloud virtual machine lifecycle security framework based on trusted computing,” in Tsinghua Science and Technology, vol. 24, no. 5, pp. 520–534, October 2019. doi: 10.26599/TST.2018.9010129, URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8698209&isnumber=8698205

[3] https://github.com/lyft/amundsen

[4] https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9

[5] https://wecode.wepay.com/posts/wepays-data-warehouse-bigquery-airflow

[6]https://prestosql.io/Presto_SQL_on_Everything.pdf

[7] https://debezium.io/

[8]https://martinfowler.com/articles/data-monolith-to-mesh.html

[9]https://www.infoq.com/presentations/data-engineering-pipelines-warehouses/