Anomaly Detection in NLP Using Levenshtein Distance

Published in

Tech Blog

7 min readApr 7, 2023

The idea of anomalies has been around for centuries. Still, it wasn’t until the development of contemporary data science and statistical techniques that it was standardized as a strategy for finding out-of-the-ordinary patterns in data. The study of outliers and extreme values has long been a significant subject in statistics, where anomaly detection has roots.

This article outlines a new method for dealing with anomalous data that uses Levenshtein distance.

What is Anomaly Detection?

Anomaly detection is a way to detect unexpected patterns called outliers. This method includes analyzing data to find unusual patterns and identifying the thresholds and boundaries that specify the anomalous data.

Anomaly detection can be used for various applications such as spam detection, fraudulent transactions, and cybersecurity. The methods for solving anomalies differ depending on the data types and use cases. A powerful tool to find risks in data, anomalies can be used to improve the model’s accuracy and make it more precise.

Anomaly Detection and Natural Language Processing

Natural language processing is the trend of today. NLP is a branch of artificial intelligence that works to understand text and speech, which are the languages of humans. It helps convert unstructured data into structured ones that can be used for modeling.

Natural Language Processing schema(expersight.com)

Natural language programming can be used in various applications, including summarizing text, performing sentiment analysis, and translating text.

Due to the complexity of natural language and anomalies, it might be challenging to identify the best strategy for improving decision-making. Numerous methods for handling anomalies include the Levenshtein method, isolation forests, local outlier factors, and one-class SVM.

This article will concentrate on resolving anomaly detection issues using the Levenshtein approach.

What is Levenshtein and how is it computed?

A metric used to compare two given strings is the Levenshtein distance. It determines how many character additions, deletions, or substitutions are necessary to change one string into the other to achieve this.

The algorithm for computing the Levenshtein distance requires building a matrix with the two strings as columns and rows and filling each cell with the minimum number of changes needed to transform the substring in one string into the substring up to that point in the other string. The Levenshtein distance between the two strings is the last value in the matrix’s bottom-right corner.

Measure of anomalies (analyticsvidhya.com)

Based on the examined data, a threshold can be established after measuring the Levenshtein distance between two texts. Text can be categorized as an anomaly if the Levenshtein measure is below the threshold. On the other hand, it can be categorized as a regular text if the Levenshtein measure exceeds the threshold.

It’s crucial to remember that threshold value selection can significantly affect anomaly detection accuracy. The threshold should be carefully chosen and fine-tuned to get the best performance for the application and the dataset being analyzed.

Steps in NLP for Anomaly Detection

Import the NLP and data processing libraries that are required.
Load history data and new data.
Clean up the text data in advance by lowering the case, eliminating stop words, and deleting punctuation.
Find a special column that can be the primary key for comparing the text data in the old and new datasets. For instance, you can contrast the messages sent by the same user at various points in time.
Specify a threshold for Levenshtein.
To decide if the new message is anomalous or not, compare the Levenshtein distance metric with the given threshold.
Label if the new message is anomalous or not.

Python code sample

`Loading libraries`

import numpy as np
import pandas as pd
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize
import re
import warnings
warnings.filterwarnings('ignore')
from Levenshtein import distance

Loading the dataset

new_data = pd.read_csv('new_data.csv')
data = od.read_csv('history.csv')

Show sample of the data

data.head()
new_data.head()

Data Cleaning Function

We should create a data cleaning function that does the following:

Lowercasing
Remove punctuation
Remove digits
Remove stop words that are commonly occurring in the documents.
The process of reducing words to their base is known as lemmatization. Eg: Cats > cat

# Define a function to preprocess text
def preprocess_text(text): 

 # Lowercase the text
 text = text.lower()
 # Remove punctuation
 text = re.sub(r’[^\w\s’, ‘’, text)
 
 # Remove digits
 text = re.sub(r’\d+’, ‘’, text)
 
 # Remove stop words
 stop_words = set(ntlk.corpus.stopwords.words(‘english’))
 tokens = word_tokenize(text)
 filtered_tokens = [token for token in tokens if token not in stop_words]
 
 # Lemmatize tokens
 lemmatizer = nltk.WordNetLemmatizer()
 lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]
 
 # Join the Lemmas into a string
 preprocessed_text = ‘ ‘.join(lemmas)
 return preprocessed_text

Apply the preprocess_text() function on the data and name the column with message.

data['message'] = data['message'].apply(preprocess_text)
new_data['message'] = dnew_data['message'].apply(preprocess_text)

Levenshtein Distance Function for anomaly detection

Now, we should create the Levenshtein distance function that does the following:

Take the originating address as a unique value.
Give a value for the threshold that is tuned according to your case and data.
Try: Find the messages that have the same originating address from historical data and new data.
Compute the match score between the new sentence and the old sentence depending on their originating addresses.
Label the new message according to the match score, so if the score is higher than the threshold, label it as normal,” and if it is lower, label it “anomalous.
Add a label column to the data that has a value of 1 if it is anomalous and a value of 0 if it is normal.
Handle any errors that happen and print them in a file.
Specify the column that is needed in the data frame to be printed as the result.

def levenshtien_anomalies (data,new_data): 

    # Loop over each originating address in the new data origins = pd.unique (new_data[‘Originaddress’])
    threshold = 0.8
    
    for origin in origins:
 
    try: 
        # Get the messages for the current originating address in the new data 
        new_data_msgs = new_data.loc[new_data[‘Originaddress’] ==origin, ‘message’).values 
        #print (new_data_msgs)
 
        # Get the messages for the current originating address in the historical data 
        hist_data_msgs = data.loc[data[‘Originaddress’]==origin, ‘message’].values 
        #print (hist_data_msgs)
 
        # Compute the match score for each new message with the historical messages 
        scores = []
        label = []

        for new_msg in new_data_msgs: 
            msg_scores = [match_score (new_msg, hist_msg) for hist_msg in hist_data_msgs] 
            scores.append(np.max(msg_scores))

        # Label the new messages based on the match scores
        labels = [0 if score > threshold else 1 for score in scores]

        # Add the Labels to the new data
        new_data.loc[new_data[‘Originaddress’]==origin, ‘label’] = labels

     except Exception as e: 

         with open(‘error.csv’, ‘w’) as f: 
             f.write(str(e))

     new_data_result = new_data.reset_index(drop = True)
     new_data_result = new_data_result[[‘Originaddress’, ‘Message’, ‘message’, ‘label’]] 
     new_data_result = new_data_result.rename (columns={ ‘message’: ‘cleaned_message”}) #print(scores)

return new_data_result

Print the result

If it is labeled 1.0 it is an anomaly and if it is labeled 0 it is normal message depending on the sender history behavior.

levenshtien_anomalies(data, new_data)

What happened after labeling the data?

Those messages that are processed and found to be expected will be added to the historical data for later examination. A report will be created and sent to the relevant team for further inquiry into the cause of any abnormal signals.

Conclusion

The problematic issue of anomaly detection in natural language processing (NLP) entails finding out-of-the-ordinary patterns or behaviors in text data. Levenshtein for natural language processing has been covered in this article as a possible way to discover anomalies in NLP.

A critical task for numerous applications, including fraud detection, cyber-security, and spam detection, is anomaly detection in natural language processing (NLP). Investigators and users should weigh the trade-offs between various approaches and techniques to select the best methodology for their unique use case.

References

Analytics Vidhya. (2021, February 24). A Simple Guide to Metrics for Calculating String Similarity. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/02/a-simple-guide-to-metrics-for-calculating-string-similarity/