Leveraging BERT and HuggingFace Transformers for Software Logs Anomalies Detection in an Unsupervised Setting

5 min readJun 8, 2024

Author: Eduardo Toledo | Github Repository

This image is a screenshot of a visualization tool displaying the embedding generated by BERT from log entries, projected using TensorFlow’s embedding projector. The main area features a scatter plot representing the log entries in a reduced-dimensional space using Principal Component Analysis (PCA). Each point corresponds to a log entry, with colors indicating whether the entry is classified as an anomaly (red) or normal (grey). — BERT-generated log embeddings visualized with TensorFlow’s embedding projector using PCA. Image by Author

Detecting unusual patterns or anomalies in log data is crucial for maintaining the health of systems and applications. In the fintech industry, where our software components generate vast amounts of logs, identifying these anomalies is even more critical to ensure security and operational efficiency. Recently, transformer models like BERT from HuggingFace have shown great promise in this area. While BERT itself doesn’t directly detect anomalies using its self-attention mechanism, it uses its advanced capabilities to generate detailed, context-aware embeddings of log entries. These embeddings, enriched by BERT’s self-attention process, provide a comprehensive understanding of the input text.

In this article, I’ll explain how BERT helps in detecting anomalies. We’ll break down the role of self-attention in BERT and how it creates these detailed embeddings during the embedding generation phase. Then, we’ll discuss how these embeddings are used by various anomaly detection algorithms to spot irregularities in log data. My goal is to make it clear how BERT can enhance log analysis and make anomaly detection more effective.

The workflow worked here is as follow:

Step 1: Log Preprocessing

Clean and preprocess log entries to remove unwanted elements such as log level tags, and other irrelevant characters.

Step 2: Generate BERT Embeddings

Convert log entries into embeddings using a pre-trained BERT model.

Step 3: Apply DBSCAN

Use DBSCAN for anomaly detection by applying the algorithm to the BERT embeddings to identify anomalies.

Step 4: Prepare Data for Embedding Projector

Save embeddings and metadata for visualization, including the log entries and their anomaly labels.

Step 5: Create Projector Config

Create a configuration file for the TensorFlow Embedding Projector to specify the paths to the embeddings and metadata files.

Step 6: Visualize

Use TensorFlow Embedding Projector to visualize the embeddings and detect anomalies by uploading the embeddings, metadata, and configuration files.

Detailed Steps

Log Preprocessing:

Clean and preprocess log entries to remove timestamps, log level tags, and other irrelevant characters.


'''
    Function: preprocess_logs
    Objective: This function preprocesses log entries by removing timestamps, operation codes, special characters, and numbers.
                It also tokenizes the log entries into words and filters out non-readable words.
    Input: log_entries. List of log entries to preprocess.
    Output: processed_logs. List of preprocessed log entries.
'''
def preprocess_logs(log_entries):
    processed_logs = []
    for entry in log_entries:
        # Simplify whitespace and strip newlines
        entry = entry.strip().replace("\n", "")
        # Remove leading timestamps and operation codes
        entry = re.sub(r"^\s*\d{2}:\d{2}:\d{2}\s+\d+\s+", "", entry)
        # Remove special formatting characters and digits clustered as error codes or IDs
        entry = re.sub(r"\d{2,}", "", entry)  # removes long sequences of digits
        entry = re.sub(r"[<>{}()\[\]]", "", entry)  # removes special characters
        # Remove sequences of backslashes and alphanumeric characters
        entry = re.sub(r'\\[0-9A-Za-z]+', '', entry)
        # Tokenize the entry into words
        words = word_tokenize(entry)
        # Filter out non-readable words (e.g., punctuation, numbers)
        readable_words = [word for word in words if word.isalpha() and len(word) > 4]
        # Reconstruct the entry from readable words
        processed_entry = " ".join(readable_words)
        processed_logs.append(processed_entry)
    return processed_logs

2. Generate BERT Embeddings:

Convert log entries into embeddings using a pre-trained BERT model. As a recall, the dimension of the embedding in this model by default is 768. Hence, the resulting tensor size is [num_log_entries, max_length, 768]

'''
Function: get_embeddings
objective: This function returns the embeddings of the input log entries
Input: log_entries, model, tokenizer, max_length
    log_entries: list of log entries
    model: BERT model
    tokenizer: BERT tokenizer
    max_length: maximum length of the input log entries. If the length of the log entry 
                is greater than max_length, it will be truncated to avold memory issues.
Output: embeddings
'''
def get_embeddings(log_entries, model, tokenizer, max_length = 128):
    # Tokenize input
    input_ids = tokenizer(log_entries, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
    # Get embeddings
    with torch.no_grad():
        output = model(input_ids=input_ids['input_ids'],
                       attention_mask=input_ids['attention_mask'], 
                       token_type_ids=input_ids['token_type_ids'])

    embeddings = output.last_hidden_state
    return embeddings

3. Apply DBSCAN:

Use DBSCAN for anomaly detection by applying the algorithm to the BERT embeddings to identify anomalies. I prefer DBSCAN (Density-Based Spatial Clustering of Applications with Noise) because it effectively identifies and handles noise and outliers, labeling points that do not belong to any cluster as noise. Additionally, it can find arbitrarily shaped clusters, which is useful for data with complex shapes, and does not require specifying the number of clusters beforehand, as it automatically determines the number of clusters based on the data.
To apply DBSCAN, we calculate the mean of the embeddings along the sequence length dimension, resulting in a numpy array with the shape: [num_entries, 768].

# Apply DBSCAN for anomaly detection
from sklearn.cluster import DBSCAN

# Calculate the mean of the embeddings along the sequence length dimension.
embeddings_np = embeddings.mean(dim=1).numpy()

dbscan = DBSCAN(eps=0.2, min_samples=2)
labels = dbscan.fit_predict(embeddings_np)

# Identify anomalies (entries with label -1)
anomalies = np.where(labels == -1)[0]
# count of anomalies
print(len(anomalies))
anomalies_cleaned=[]
print("Anomalous log entries:")
for idx in anomalies:
  print(log_entries[idx])

Afterward, visualize the clusters in a 2-component PCA space to observe the behavior of the anomalies.

4. Prepare Data for Embedding Projector:

Save embeddings and metadata for visualization, including the log entries and their anomaly labels.

# Save embeddings and metadata
np.savetxt('embeddings.tsv', embeddings_np, delimiter='\t')
with open('metadata.tsv', 'w') as f:
    f.write("log_entry\tis_anomaly\n")
    for i, entry in enumerate(processed_entries):
        is_anomaly = 'yes' if i in anomalies_cleaned else 'no'
        f.write(f"{entry}\t{is_anomaly}\n")

5. Visualize:

Use TensorFlow Embedding Projector to visualize the embeddings and detect anomalies by uploading the embeddings, metadata, and configuration files.

Log Entries Embedding projected in Tensorboard

In this interactivity, different clusters of embeddings are visible. The most notable aspect is that anomalies are detected within the cluster associated with transactional contexts. This indicates that BERT has focused on entries related to the transaction engine, avoiding distractions from entries irrelevant to the system’s functioning.

https://vimeo.com/955287760?share=copy

Conclusion

Playing with TensorFlow Embedding Projector, I can see the level of accuracy of the anomalies detected. Using a pre-trained BERT model to get contextual embeddings greatly improves the ability to spot anomalies in log entries. These embeddings, which capture rich and detailed language context, allow us to effectively identify unusual patterns and outliers without needing to retrain the model extensively. This approach saves both time and computational resources while taking full advantage of BERT’s powerful language understanding. Additionally, applying DBSCAN to cluster these embeddings helps accurately distinguish normal log entries from anomalies, ensuring a reliable and robust method for detecting anomalies.

Author: Eduardo Toledo, Machine Learning Engineer , part of Passport Technology’s analytical cell. This cell is dedicated to providing data-driven insights crucial for informed decision-making.

Follow me on [GitHub](https://github.com/etechoptimist), [LinkedIn](https://linkedin.com/in/etechoptimist), and [Twitter](https://twitter.com/etechoptimist) for more updates.