Anomaly Detection in NLP Using Levenshtein Distance

Fatima Mubarak
Tech Blog
Published in
7 min readApr 7, 2023

The idea of anomalies has been around for centuries. Still, it wasn’t until the development of contemporary data science and statistical techniques that it was standardized as a strategy for finding out-of-the-ordinary patterns in data. The study of outliers and extreme values has long been a significant subject in statistics, where anomaly detection has roots.

Photo by Randy Fath on Unsplash

This article outlines a new method for dealing with anomalous data that uses Levenshtein distance.

What is Anomaly Detection?

Anomaly detection is a way to detect unexpected patterns called outliers. This method includes analyzing data to find unusual patterns and identifying the thresholds and boundaries that specify the anomalous data.

Unexpected pattern (github.com)

Anomaly detection can be used for various applications such as spam detection, fraudulent transactions, and cybersecurity. The methods for solving anomalies differ depending on the data types and use cases. A powerful tool to find risks in data, anomalies can be used to improve the model’s accuracy and make it more precise.

Anomaly Detection and Natural Language Processing

Natural language processing is the trend of today. NLP is a branch of artificial intelligence that works to understand text and speech, which are the languages of humans. It helps convert unstructured data into structured ones that can be used for modeling.

Natural Language Processing schema(expersight.com)

Natural language programming can be used in various applications, including summarizing text, performing sentiment analysis, and translating text.

Due to the complexity of natural language and anomalies, it might be challenging to identify the best strategy for improving decision-making. Numerous methods for handling anomalies include the Levenshtein method, isolation forests, local outlier factors, and one-class SVM.

This article will concentrate on resolving anomaly detection issues using the Levenshtein approach.

What is Levenshtein and how is it computed?

A metric used to compare two given strings is the Levenshtein distance. It determines how many character additions, deletions, or substitutions are necessary to change one string into the other to achieve this.

Levenshtein work (ideserve.co.in)

The algorithm for computing the Levenshtein distance requires building a matrix with the two strings as columns and rows and filling each cell with the minimum number of changes needed to transform the substring in one string into the substring up to that point in the other string. The Levenshtein distance between the two strings is the last value in the matrix’s bottom-right corner.

Measure of anomalies (analyticsvidhya.com)

Based on the examined data, a threshold can be established after measuring the Levenshtein distance between two texts. Text can be categorized as an anomaly if the Levenshtein measure is below the threshold. On the other hand, it can be categorized as a regular text if the Levenshtein measure exceeds the threshold.

It’s crucial to remember that threshold value selection can significantly affect anomaly detection accuracy. The threshold should be carefully chosen and fine-tuned to get the best performance for the application and the dataset being analyzed.

Steps in NLP for Anomaly Detection

  • Import the NLP and data processing libraries that are required.
  • Load history data and new data.
  • Clean up the text data in advance by lowering the case, eliminating stop words, and deleting punctuation.
  • Find a special column that can be the primary key for comparing the text data in the old and new datasets. For instance, you can contrast the messages sent by the same user at various points in time.
  • Specify a threshold for Levenshtein.
  • To decide if the new message is anomalous or not, compare the Levenshtein distance metric with the given threshold.
  • Label if the new message is anomalous or not.

Python code sample

Loading libraries

import numpy as np
import pandas as pd
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize
import re
import warnings
warnings.filterwarnings('ignore')
from Levenshtein import distance

Loading the dataset

new_data = pd.read_csv('new_data.csv')
data = od.read_csv('history.csv')

Show sample of the data

data.head()
new_data.head()
History data sample
New data sample

Data Cleaning Function

We should create a data cleaning function that does the following:

  • Lowercasing
  • Remove punctuation
  • Remove digits
  • Remove stop words that are commonly occurring in the documents.
  • The process of reducing words to their base is known as lemmatization. Eg: Cats > cat
# Define a function to preprocess text
def preprocess_text(text):

# Lowercase the text
text = text.lower()
# Remove punctuation
text = re.sub(r’[^\w\s’, ‘’, text)

# Remove digits
text = re.sub(r’\d+’, ‘’, text)

# Remove stop words
stop_words = set(ntlk.corpus.stopwords.words(‘english’))
tokens = word_tokenize(text)
filtered_tokens = [token for token in tokens if token not in stop_words]

# Lemmatize tokens
lemmatizer = nltk.WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# Join the Lemmas into a string
preprocessed_text = ‘ ‘.join(lemmas)
return preprocessed_text

Apply the preprocess_text() function on the data and name the column with message.

data['message'] = data['message'].apply(preprocess_text)
new_data['message'] = dnew_data['message'].apply(preprocess_text)

Levenshtein Distance Function for anomaly detection

Now, we should create the Levenshtein distance function that does the following:

  • Take the originating address as a unique value.
  • Give a value for the threshold that is tuned according to your case and data.
  • Try: Find the messages that have the same originating address from historical data and new data.
  • Compute the match score between the new sentence and the old sentence depending on their originating addresses.
  • Label the new message according to the match score, so if the score is higher than the threshold, label it as normal,” and if it is lower, label it “anomalous.
  • Add a label column to the data that has a value of 1 if it is anomalous and a value of 0 if it is normal.
  • Handle any errors that happen and print them in a file.
  • Specify the column that is needed in the data frame to be printed as the result.
def levenshtien_anomalies (data,new_data): 

# Loop over each originating address in the new data origins = pd.unique (new_data[‘Originaddress’])
threshold = 0.8

for origin in origins:

try:
# Get the messages for the current originating address in the new data
new_data_msgs = new_data.loc[new_data[‘Originaddress’] ==origin, ‘message’).values
#print (new_data_msgs)

# Get the messages for the current originating address in the historical data
hist_data_msgs = data.loc[data[‘Originaddress’]==origin, ‘message’].values
#print (hist_data_msgs)

# Compute the match score for each new message with the historical messages
scores = []
label = []

for new_msg in new_data_msgs:
msg_scores = [match_score (new_msg, hist_msg) for hist_msg in hist_data_msgs]
scores.append(np.max(msg_scores))

# Label the new messages based on the match scores
labels = [0 if score > threshold else 1 for score in scores]

# Add the Labels to the new data
new_data.loc[new_data[‘Originaddress’]==origin, ‘label’] = labels

except Exception as e:

with open(‘error.csv’, ‘w’) as f:
f.write(str(e))

new_data_result = new_data.reset_index(drop = True)
new_data_result = new_data_result[[‘Originaddress’, ‘Message’, ‘message’, ‘label’]]
new_data_result = new_data_result.rename (columns={ ‘message’: ‘cleaned_message”}) #print(scores)

return new_data_result

Print the result

If it is labeled 1.0 it is an anomaly and if it is labeled 0 it is normal message depending on the sender history behavior.

levenshtien_anomalies(data, new_data)
Output table with the label

What happened after labeling the data?

Those messages that are processed and found to be expected will be added to the historical data for later examination. A report will be created and sent to the relevant team for further inquiry into the cause of any abnormal signals.

Conclusion

The problematic issue of anomaly detection in natural language processing (NLP) entails finding out-of-the-ordinary patterns or behaviors in text data. Levenshtein for natural language processing has been covered in this article as a possible way to discover anomalies in NLP.

A critical task for numerous applications, including fraud detection, cyber-security, and spam detection, is anomaly detection in natural language processing (NLP). Investigators and users should weigh the trade-offs between various approaches and techniques to select the best methodology for their unique use case.

References

--

--

Fatima Mubarak
Tech Blog

Data scientist @montymobile | In my writing, I explore the fields of data science , machine learning and related topics.