Cyber Security in the Age of AI

Murali Balcha
12 min readJun 19, 2024

--

1. Cybersecurity and Advanced Persistent Threats

Protecting corporate assets against cyberattacks requires multidisciplinary and multi-dimensional approaches. At the same time, exhaustively listing all these approaches in this document is impossible. One effective defense many organizations employ is analyzing security logs to respond to threats in real time. Security Information and Event Management (SIEM) is a security solution that helps organizations identify and address potential security threats before they disrupt business operations. SIEM tools monitor an organization’s IT environment by correlating log and event data across systems, generating prioritized alerts, and enabling automated responses to potential security incidents based on customized policies and data analytics. SIEM solutions have been a staple of cyber security analysts’ toolkits for decades. Still, most existing solutions rely on traditional technologies, such as mining log templates, normalization, log storage, and supporting correlated queries to identify threat activity. In the age of AI, current SIEM solutions are inadequate to protect against cyber attacks. The rise of Gen AI presents both benefits and risks. In the hands of cyber threat actors, it can help create new threats, exponentially increasing zero-day attacks and making traditional SIEM solutions useless. Most new threats utilize subtle techniques to infiltrate and carry out attacks, including sophisticated social engineering, for which Gen AI can be beneficial. Furthermore, as threats use very little code, it becomes challenging for SIEM tools to detect them effectively. A more effective approach would be using AI techniques and large language models (LLMs) to analyze logs, establish a normal behavior baseline, and detect real-time anomalies to address cyber threats effectively.

2. What is wrong with traditional log analysis?

The log analysis involves normalization, parameter extractions, storage, and correlated queries. However, these methods are not adaptive and must be performed for every type of log found. Correlated queries are statements created by cybersecurity experts to identify known threats, but they are rule-based and fragile. Correlated queries are inadequate to detect new threats that use social engineering and Command and Control tactics to infiltrate and execute the threat. Adversaries, who are typically knowledgeable about system operations, often mimic regular traffic to avoid detection and may change techniques based on the situation. Zero-day attacks, the most devastating, are becoming more frequent with the rise of generative AI, as threat actors leverage generative AI to create new threats with minimal effort.

3. Large Language Models Revolution

Large language models are currently very popular for a good reason. They have proven their effectiveness in various fields, such as generative AI, sentiment analysis, and anomaly detection. Language is the primary way intelligent species communicate, and a language is not limited to human languages like English, Hindi, Arabic, or French. For example, whales and elephants communicate over long distances using clicks and groans, and many other species likely do the same. These languages have their own vocabulary and structure. The AI community has harnessed the power of language models to understand and derive insights from the enormous amount of information humans produced since the beginning of civilization. These models are highly capable and can address various real-world problems. One such use case could be using LLMs to analyze computer system logs. Assuming each system and application communicates through logging, they describe various tasks and processes. Each set of log entries provides a window into the inner workings of these systems. We can think of the sequence of log entries as stories, with log parameters as the characters’ names. Just as the story remains the same when we change the characters’ names, the narrative conveyed by the sequence of log entries doesn’t change when their parameters change. Large language models can capture these narratives effectively, and we can use them to develop efficient solutions for future log analysis.

4. Why LLMs for Log Analysis?

Large Language Models (LLMs) are based on the transformer model, which introduces the concept of self-attention. Self-attention is a critical mechanism that allows the model to focus on different parts of the input sequence when processing each element. LLMs provide a foundation for generative AI, text classification, and anomaly detection. LLMs capture the deeper context in the input sequence and detect anomalies effectively, which is impossible with correlated queries. Another critical aspect of transformers is that they are capable of unsupervised training, although a more precise explanation is that transformers perform self-learning. Through this process, transformers learn to understand basic grammar, languages, and knowledge. Unsupervised training is highly scalable and perfect for training LLMs for log analysis.

5. Challenges Extending LLMs for Log Analysis

However, we cannot just extend LLM foundational models for log analysis using zero-shot or few-shot training methods for the following reasons.

5.1 Log Entries are not English Statements.

A typical log entry may look like an English sentence, but it is not from LLM’s point of view. We cannot tokenize log entries using traditional tokenizing methods.

5.2 Log entries don’t follow the same sequence as a sentence in a language.

A typical log file contains a chronologically ordered collection of log entries. However, logs from various systems and applications are interspersed because systems are executed in a distributed and parallel fashion. For example, the second log entry of application one may appear after 10 log entries of application 2.

5.3 Different Applications follow different conventions for logging

Linux Syslog looks something like

Jan 22 05:51:09 combo sshd(pam_unix)[24935]: check pass; user unknown
Jan 22 05:51:09 combo sshd(pam_unix)[24935]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=server3.sugolan.hu
Jan 22 05:51:10 combo sshd(pam_unix)[24937]: check pass; user unknown
Jan 22 05:51:10 combo sshd(pam_unix)[24937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=server3.sugolan.hu
Jan 22 06:08:26 combo sshd(pam_unix)[24968]: check pass; user unknown
Jan 22 06:08:26 combo sshd(pam_unix)[24968]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=server3.sugolan.hu
Jan 22 16:37:08 combo sshd(pam_unix)[25820]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 16:37:09 combo sshd(pam_unix)[25823]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 16:37:09 combo sshd(pam_unix)[25822]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 16:37:09 combo sshd(pam_unix)[25826]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 16:37:14 combo sshd(pam_unix)[25830]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 16:37:15 combo sshd(pam_unix)[25832]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 16:37:16 combo sshd(pam_unix)[25835]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 16:37:16 combo sshd(pam_unix)[25834]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 16:37:16 combo sshd(pam_unix)[25828]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 16:37:17 combo sshd(pam_unix)[25838]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=61.129.113.52 user=root
Jan 22 20:01:14 combo sshd(pam_unix)[26136]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26136]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 22 20:01:14 combo sshd(pam_unix)[26138]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26138]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 22 20:01:14 combo sshd(pam_unix)[26137]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26137]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 22 20:01:14 combo sshd(pam_unix)[26133]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26133]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 22 20:01:14 combo sshd(pam_unix)[26135]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26135]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 22 20:01:14 combo sshd(pam_unix)[26134]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26139]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26139]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 22 20:01:14 combo sshd(pam_unix)[26134]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 22 20:01:14 combo sshd(pam_unix)[26140]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26140]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 22 20:01:14 combo sshd(pam_unix)[26146]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26146]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 22 20:01:14 combo sshd(pam_unix)[26148]: check pass; user unknown
Jan 22 20:01:14 combo sshd(pam_unix)[26148]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220.75.233.249
Jan 23 04:04:10 combo su(pam_unix)[27145]: session opened for user cyrus by (uid=0)
Jan 23 04:04:11 combo su(pam_unix)[27145]: session closed for user cyrus
Jan 23 04:04:12 combo logrotate: ALERT exited abnormally with [1]
Jan 23 04:10:38 combo su(pam_unix)[28378]: session opened for user news by (uid=0)
Jan 23 04:10:39 combo su(pam_unix)[28378]: session closed for user news
Jan 23 05:28:15 combo sshd(pam_unix)[28526]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=ztc2-f.nas.tiscali.de user=root
Jan 23 05:28:15 combo sshd(pam_unix)[28521]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=ztc2-f.nas.tiscali.de user=root
Jan 23 05:28:15 combo sshd(pam_unix)[28534]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=ztc2-f.nas.tiscali.de user=root
Jan 23 05:28:15 combo sshd(pam_unix)[28532]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=ztc2-f.nas.tiscali.de user=root
Jan 23 05:28:15 combo sshd(pam_unix)[28524]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=ztc2-f.nas.tiscali.de user=root

Whereas Windows Logs look like

<?xml version="1.1" encoding="utf-8" standalone="yes" ?>
<Events>
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"><System><Provider Name="Microsoft-Windows-Eventlog" Guid="{fc65ddd8-d6ef-4962–83d5–6e5cfe9ce148}"></Provider>
<EventID Qualifiers="">104</EventID>
<Version>1</Version>
<Level>4</Level>
<Task>104</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2024–01–15 04:00:10.209133"></TimeCreated>
<EventRecordID>1180</EventRecordID>
<Correlation ActivityID="" RelatedActivityID=""></Correlation>
<Execution ProcessID="2288" ThreadID="27864"></Execution>
<Channel>System</Channel>
<Computer>mb-windows-11-l</Computer>
<Security UserID="S-1–5–21–2113447201–4214472215–34991859–500"></Security>
</System>
<UserData><LogFileCleared xmlns="http://manifests.microsoft.com/win/2004/08/windows/eventlog"><SubjectUserName>murali.balcha</SubjectUserName>
<SubjectDomainName>mb-windows-11-l</SubjectDomainName>
<Channel>System</Channel>
<BackupPath>\\mb-windows-11-l\C$\Users\murali.balcha\Documents\apt3-system-discovery-2.evtx</BackupPath>
<ClientProcessId>22796</ClientProcessId>
<ClientProcessStartKey>1125899906850495</ClientProcessStartKey>
</LogFileCleared>
</UserData>
</Event>
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"><System><Provider Name="Microsoft-Windows-Eventlog" Guid="{fc65ddd8-d6ef-4962–83d5–6e5cfe9ce148}"></Provider>
<EventID Qualifiers="">104</EventID>
<Version>1</Version>
<Level>4</Level>
<Task>104</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2024–01–15 04:00:39.080488"></TimeCreated>
<EventRecordID>1181</EventRecordID>
<Correlation ActivityID="" RelatedActivityID=""></Correlation>
<Execution ProcessID="2288" ThreadID="22040"></Execution>
<Channel>System</Channel>
<Computer>mb-windows-11-l</Computer>
<Security UserID="S-1–5–21–2113447201–4214472215–34991859–500"></Security>
</System>
<UserData><LogFileCleared xmlns="http://manifests.microsoft.com/win/2004/08/windows/eventlog"><SubjectUserName>murali.balcha</SubjectUserName>
<SubjectDomainName>mb-windows-11-l</SubjectDomainName>
<Channel>Microsoft-Windows-Sysmon/Operational</Channel>
<BackupPath>\\mb-windows-11-l\C$\Users\murali.balcha\Documents\apt3-sysmon-discovery-2.evtx</BackupPath>
<ClientProcessId>22796</ClientProcessId>
<ClientProcessStartKey>1125899906850495</ClientProcessStartKey>
</LogFileCleared>
</UserData>
</Event>
.
.
.
</Events>

However, each log entry has a set of attributes, including a timestamp, application name, process ID, user ID, severity, and a log message. The log message includes a textual description of the event and a set of parameters relevant to that particular event.

6. Training LLM from Scratch for a New Language

Training an LLM for a new language is quite involving and requires four distinct steps:

6.1 Data Curation

6.1.1 Data Gathering

Data scientists gather training data for language models (LLMs) from various text sources, such as web pages, books, scientific articles, codebases, and conversational data.

6.1.2 Data Cleanup

Data cleaning is essential in any machine learning model, especially in NLP. The dataset is often a jumble of words that the computer cannot understand without cleaning. Essential steps include:

  1. Removing punctuations
  2. Stop words
  3. Remove URLs
  4. Remove HTML tags
  5. Remove emoji
  6. Remove numbers

6.1.3 Preparing the data

Once the data is collected, it needs to be prepared for LLM training. This preparation involves quality filtering, deduplication, privacy redaction, lemmatization, tokenization, and embedding calculation.

6.2 Model Architecture

Use cases and size of vocabularies determine the model architecture. Assuming that we use transformers-based models, we may choose a decoder-only, encoder-only, or encoder-decoder model, depending on whether you generate, classify, or translate text. The size of the vocabulary determines the number of parameters. There is no hard and fast set rule, but various academic research recommends 20 tokens to one model parameter.

6.3 Training at Scale

Large language models (LLMs) are trained via self-supervised or unsupervised learning.

6.4 Evaluation

A vital part of this iterative process is model evaluation, which examines model performance on tasks. While the task set depends mainly on the desired application of the model, many benchmarks are commonly used to evaluate LLMs.

The Open LLM leaderboard hosted by Hugging Face aims to provide a general performance ranking for open-access LLMs. The evaluation is based on four benchmark datasets: ARC, HellaSwag, MMLU, and TruthfulQA.

7. Training LLMs for Log Analysis

To extend LLMs for log analysis, we must follow similar steps as when training a new language. However, data preparation poses the biggest challenge for training LLMs for log analysis. We cannot use traditional methods to curate the training data. For example, there is a vast corpus of data readily available for training a language, but this is sorely missing for log analysis. Assuming that we can collect enough logs, our initial focus will be on anomaly detection in logs using LLMs. Later, we can develop a model for threat classification.

7.1 Data Gathering

The most challenging part of creating a model is gathering diverse and representative data from different applications and systems. Luckily, logs are generated by machines. Once you gather a sufficient amount of logs, you can combine them or synthesize to create a large volume of training data.

7.1.1 Template

The practice of mining templates from log data has been an industry standard for decades, with various vendors excelling in this area. However, traditional log mining involves carefully analyzing the logs and extracting information such as timestamps, process names, process IDs, task IDs, log messages, and parameters. In most cases, log parsing simply involves using the sscanf stdlib API on each log statement, but it is a critical step in model building.

A scalable method for creating a template mining pipeline involves using heuristic and AI-based algorithms to extract parameters and mine templates dynamically. Drain3 is a reliable log mining tool that utilizes real-time fixed-depth parsing to mine templates. Users have the option to deploy other AI-based tools for efficient log mining.

7.1.2 Synthesizing Logs

Enhancing the size of the log corpus through synthetic data generation should be a part of data collection. While obtaining a large amount of real-time log data is impossible, synthetic log data can be easily produced.

7.1.3 Preparing the Data

Building vocabulary is essential to any language learning methodology (LLM). Vocabulary comprises a unique set of words that form a language. In log analysis, a vocabulary consists of distinct log templates.

7.1.4 Calculating Embeddings

LLMs have clearly defined embeddings, such as TF-IDF, BOW, word2vec, BPE, and BERT. These embeddings create a vector for each word in the language within an n-dimensional space. Typically, word embeddings show semantic similarity, meaning that words with related meanings are located close to each other in the embedding space. This proximity can be measured using cosine distance. We can utilize existing models to create similar embeddings for log templates so that related log templates are positioned close to each other in the embedding space.

7.2 Model Architecture

The latest LLMs are based on a transformer model, which has limitations despite being strong in many ways. One such limitation is its context length, which usually ranges from 2048 to 4096, making transformer models inefficient for log analysis. Log analysis often requires larger context windows as the relevant log entries are spread over long sequences. For example, we may need to analyze thousands of log entries to track an activity from start to finish, with many log entries mixed in from unrelated activities. Because the transformer model focuses on the current token and all tokens that come before it, it is unsuitable for log analysis. This limitation highlights the need for alternative models such as Mamba. Mamba is a relatively new foundational model for LLMs that addresses some of the limitations of a transformer, like limited context length and quadratic complexity.

7.2.1 A New Class of Selective State Space Models

Mamba proposes an advanced class of SSMs that match the modeling capabilities of Transformers while scaling linearly in sequence length. The basic tenets of Mamba is as follows:

Selection Mechanism:

  • Identifies the inability of prior models to selectively focus on or ignore inputs based on their relevance as a key limitation.
  • Introduces a selection mechanism where SSM parameters are input-dependent, allowing the model to retain relevant information and discard irrelevant data selectively

Scaling linearly in sequence length and the ability to handle long sequences make Mamba models ideal for log analysis.

8. Threat Classification

An advanced persistent threat (APT) is a sustained and stealthy cyberattack in which an intruder or group gains unauthorized access to a computer network and remains undetected for an extended period. Unlike traditional cyberattacks, APT attacks aim to steal sensitive data rather than damage the network. These attacks are often carried out manually through meticulous planning and can utilize innovative hacking methods. The Advanced Persistent Threat (APT) lifecycle involves a stealthy and persistent attack that can last for months or even years. The ultimate aim of an APT attack is to steal sensitive data or damage a target’s operations, which may eventually lead to a full attack for financial gain.

MITRE ATT&CK® has identified over 130 advanced persistent threats (APTs) that affect enterprises, and this number is expected to increase due to the growing prevalence of generative AI. LLM-based log analysis, a practical application of LLMs, determines the type of threat and the stage at which the threat is present. This poses a multi-class classification challenge, with the number of classes matching the number of threats and the phases in the threat life cycle. The LLM needs to be trained to accurately identify each threat’s stage from the logs.

9. Conclusion

LLMs, or Large Language Models, have gained popularity due to their broad applicability in various NLP tasks. One of the primary uses of LLMs is text generation on any given topic on which they have been trained. In addition to text generation, LLMs can also be utilized for tasks like the one described in this blog. With cyber threats becoming increasingly sophisticated, traditional log analysis methods are proving inadequate in the age of AI. LLMs can play a crucial role in making log analysis more adaptive by exploring the contextual meaning in logs, thereby enhancing the detection of zero-day threats.

--

--