An LSTM-based Approach for Email Signature Identification

Jean Baptiste Polle
8 min readSep 24, 2021

--

Email signature identification is a common task in natural language processing, with applications ranging from email management to spam detection. Automatically identifying signature lines in emails can help organize and categorize emails, filter out spam or unwanted messages, and even improve the accuracy of sentiment analysis. However, accurately identifying signature lines in emails can be challenging due to the variability in the formatting and content of signatures, as well as the presence of other types of text or content that may resemble a signature.

In this article, I propose a new approach for identifying signature lines in emails using Long Short-Term Memory (LSTM) neural networks. LSTMs are a type of recurrent neural network that is well-suited for modeling sequential data. This makes them a good choice for identifying email signatures, as emails can be analyzed line by line, taking into account how each line affects the classification of the next one. I will describe the process of training and evaluating an LSTM model on a dataset of emails, and present the results and conclusions. The approach shows promising results and offers a new perspective on the use of LSTMs for email signature identification.

Dataset

For this study, a limited dataset of French emails was used, consisting of approximately 5000 lines. Each line was manually classified as being part of a signature or not.

  • 2294 which were not part of the signature (46%)
  • 2625 which were part of the signature (54%)

This result illustrates a potential problem when working with emails: the inclusion of signatures, which can comprise a significant portion of the text (in this instance, more than 50% of the total lines). If these signatures are not properly identified and separated from the main body of the text, the accuracy of any analysis (such as clustering or topic modeling) may be affected. It is crucial to accurately identify and distinguish signatures in emails to ensure the reliability of the results.

As I continue with the process, I split the dataset into a training set and a testing set. It’s important to consider the possibility of data leakage when dividing the data in this way, especially as I am working with a small number of sources. Data leakage can occur if signatures from the same person appear in both the training and test data, as this could potentially skew the model’s performance. To address this issue, I split the dataset by source to ensure that all emails from “A” are in the test set and emails from “B” are in the training set, so evaluation of performance is done on signatures never saw during training.

Features

For each lines of the email I computed the following features:

  1. Words count for the line
  2. Position of each line in the email (number between 0 and 1, where 0 is the first line in email and 1 is the last line)
  3. Special characters count
  4. Number of empty lines before
  5. Cosine distance between the embedding of the line and the embedding of “thank you” (using specific sentence transformer model). This should give a small distance for the closing of an email and a longer distance otherwise. I then take the inverse of this value (capped to a maximum).
  6. Using named entity recognition model and regex, we add count of person, organization, location, date, telephone, email and website address for each line.

This gives us a total of 12 features. Let’s take a small example with the following email:

Bonjour Vincent,

Merci de m’avoir rappelé hier.

Seriez vous disponible pour un rendez vous la semaine prochaine?

Merci,

Jean-Baptiste

The list of features we obtain is (with only relevant values):

Correlation analysis

A high-level correlation analysis shows some interesting correlation between prediction of output variable (“is_signature”) and some of the features.

We can notice a positive correlation for:

  • “position_line”: Position of the line in the email
  • “PER”, “ORG”, “LOC”, ”TEL”, “EMAIL”, “WEB”: Count of entities such as person names, organizations, locations,…
  • “inv_distance_to_merci_previous”: for the non-LSTM models, I added for each line the previous value of the inverse of cosine distance to “thank you”. The reason is that a line following a sentence similar to “thank you” has a good chance to be part of a signature and this should help the simple models make a correct prediction.
Correlation heatmap

On the other side, we can notice a negative correlation for:

  • Word count
  • Date
  • “inv_distance_to_merci”

This seems to make sense as we wouldn’t expect a date or long lines in a signature. Neither would we expect “thank you” to be included in a signature.

Simple models

To establish a benchmark against which to compare the performance of the LSTM model, I evaluated a range of standard models by averaging the results of multiple training runs with different splits of training/test data. Before training the models, I applied feature scaling using either sklearn’s MinMax Scaler or Standard Scaler depending on the distribution of values for each feature.

  • Logistic Regression
Metrics logistic regression

This model is interesting as it allows us to have an idea of the most important features used by the model:

As previously shown in the correlation analysis, features such as the count of organization names, count of phone numbers, and position of the line within the email have a significant impact on the likelihood of a line being a signature. On the other hand, features like the count of dates and the number of words in the sentence decrease the probability of a line being a signature. The inverse distance to ‘thank you’ in the current and previous lines has an opposite effect, with the first feature indicating that the line is less likely to be a signature and the second feature increasing the chances that it is a signature, as expected.

  • SVC
Metrics SVC
  • Decision Tree
Metrics decision tree
  • Random forest
Metric random forest

The best F1 score we achieved was approximately 85%. While it is possible that this score could be improved by adjusting the model parameters, the primary objective of this study was to establish a baseline performance using standard models before attempting to use a LSTM model for the same task. Overall, the results suggest that these models are capable of accurately identifying signatures in emails, but further optimization may be needed to achieve even higher performance.

LSTM

The use of LSTM in signature detection aims to leverage information from previous lines of the email to predict whether a subsequent line is a signature. This is particularly useful in cases where the previous line contains a closing phrase such as “thank you”, “thanks”, or “regards”. LSTM is also effective in identifying multi-line signatures, as the identification of a previous line as a signature should impact the prediction of subsequent lines.

We used a bi-directional LSTM model to take advantage of the predictive power of both preceding and following lines in predicting the current line. This again enables the model to consider the context of a line, such as its proximity to the closing of the email (for a line preceding the closing), in determining whether it is likely to be part of a signature.

In the LSTM model, each line of the email is treated as a single step. To standardize the input to the model, I fixed the number of steps at 50 and only retained the last lines of emails that exceeded this number of line. For shorter emails, I padded the remaining steps with zeros. This allows me to create batches of uniform size: 50 steps (the maximum number of lines) multiplied by the number of features.

If we look again at the email example above, it would look like that:

Each column is a step in input of the LSTM network. Columns 0 to 3 are the padding that we added.

Finally we add a dense layer with 1 unit and a sigmoid output to get a result between 0 and 1.

Therefore, the model is composed of three layers:

  • The input (dimension = Max number of lines in emails x Number of features)
  • Bidirectional LSTM layer
  • Dense output (Dimension = 1)
Model

Size of the state used was 40 for each LSTM, which lead to a small model of 16721 parameters.

The results we obtain were:

Metrics LSTM

There was an increase of more than 10% of the f1 score using the LSTM model.

In practical applications, this model has proven to be highly effective in accurately identifying most signatures, which in turn significantly improves the results of downstream analysis.

However, I have observed some recurring issues, such as the failure to detect signatures automatically added to emails sent from phones (“Sent from my iPhone”). I believe that adding a new feature to measure the distance to “Sent from” could easily resolve this issue.

Another potential solution is to include the full embedding of each line in the model, which would provide the model with a sense of the meaning of each line and likely improve performance. However, this approach would significantly increase the size of the model and could present some logistical challenges.

In conclusion, this approach to signature detection shows promising results and has the potential for further improvement. The technique demonstrates a high level of accuracy and efficiency in detecting signatures, making it a valuable tool for a variety of applications. If you have tried similar approaches for signature detection or have any questions about this article, please do not hesitate to contact me or leave a comment. Thank you for reading.

--

--