Machine Learning for Email Signature Detection
As a Data Scientist at snapADDY, improving the ability of our products to capture leads and contacts automatically is one of my main tasks. Our most recent product, the snapADDY Assistant, helps you maintain your Outlook 365 address book by checking incoming emails for contained contact information. In this article I will show you how we employ machine learning to achieve this.
In business communication, emails often contain signatures with contact information — in fact, in some countries this is required by law. As these signatures represent the company to the outside, they are usually up-to-date and a reliable source of contact information. We want to use this information to suggest new or updated contacts to the user.
The data team at snapADDY has been working on a system for contact recognition for the past four years. Given clean input text, our system is able to extract contact information quite successfully. However, this system was designed to work with input text that mainly contains contact data. Input with a lot of irrelevant information (noise) can confuse the system, and leads to low recognition quality. In the context of emails, such noise could be a long chain of replies above the signature of the original sender.
In order to avoid misleading lines in the body of an email we need to extract the signature before feeding the text into our contact recognition pipeline. While the email header is specified in the email protocol standard RFC 5332 and can be extracted using a regular expression, we still need to separate the email body and the signature — but how?
A first approach could be a valediction-based heuristic: everything after ‘best regards’ is likely to be the signature. Because the valediction is typically the end of an email this could be a good baseline. The implementation of this idea, however, is somewhat challenging: it has to deal with different languages, unusual valedictions, or simply typos. How can machine learning help in this case?
Machine Learning for Signature Detection
Before we start to work on a prototype, we should clearly define our objective:
Extract all lines in the signature containing relevant contact information for further processing.
This can be considered a classification problem, where relevant is defined via examples (i.e. supervised learning). A supervised learning algorithm learns a function
f that maps an input
X to a target vector
y . In the following sections we will discuss how to define these components based on a set of emails used as training data.
Creating the Target Vector Y
Because the model
f requires input-output examples
(x_i, y_i) for training we need to extract the signature from each email in our data set manually. For every email, we mark each line as either
regular text or
signature — simply
1. All lines together form our target vector
y . After labeling some test cases we are able to plot the signature-lines-to-text ratio and further explore our data set.
As expected, the (large) majority of the lines in our test emails are regular text lines, while only relatively few lines belong to signatures. This disproportion can cause problems when training machine learning algorithms: if the objective function is not carefully defined a classifier that (trivially) labels everything as regular text will achieve high accuracy, but fails solving the actual problem. An objective function that rewards a low false negative rate (i.e. few signature lines classified as regular text), can solve this issue. An example for such a function would be recall.
Extracting Features — Crafting the X
A human observer is generally able to find the lines belonging to a signature without reading the complete email carefully. We seek to encode this ‘intuition’ in form of an encoding that will transform the raw text into our data matrix
X. We use the following reasoning to create several features. As the signature is usually located at the bottom of an email, it seems plausible to use the line number as a feature. Also, the signature is often preceded by a valediction, so we can create a feature based on a list of common valedictions. If we take a look at the number of words per line, we see that the final lines of the email body are noticeably shorter (compare for the image above). We use the number of words in a line as another feature. This feature matrix can be extended with other possible features like line length (in characters) or number of special character occurrences, e.g. the number of digits per line. In contrast to the valediction-based approach, now we do not solely rely on a single, binary indicator but a whole set. Using this set of features, we can convert each line of text into a (row) vector
x_i and feed it into our model
Building the Model
In order to iterate quickly, we start with some simple, traditional machine learning algorithms. If they already provide good results, we have a solid starting point and can proceed. As a first shot, we tried Logistic Regression, Stochastic Gradient Descent, Random Forest, and XGBoost.
We first discuss logistic regression, because it is easy to interpret and has the nice property of being able to display feature significance. Our results showed that every feature was significant at the p = .01 level, thus we keep all our features. If we plot the ROC curve, we expect our model to (at least) outperform a randomly guessing classifier and (ideally) be close to the (0, 1) corner. The red diagonal line in the figure above represents the result for randomly guessing. The logistic regression model (blue line) performs significantly better than the random classifier. With that result and an area under the curve of 0.84, we have a good first model.
In the end XGBoost performed best, i.e. had the lowest amount of false negatives while having a good overall performance (according to its average F1-score). Thanks to the xgboost package, training the model is as easy as:
Using Scikit-Learn we can generate a confusion matrix for evaluating our model. Again, signature lines are labeled as
1 and regular text as
Finding the Best Hyper Parameters
Let’s step back and rethink our initial goal: we are particularly interested in finding every relevant line, as any missed line and its information is lost for all further steps, e.g. the contact recognition pipeline. This means we prefer a high recall rather than a high precision. With this objective function we can now use a grid search for finding the best hyper parameters. As there are multiple parameters, we progress by fixing all parameters except one or two. We start by optimizing for
We add the optimized parameters to
fix_params, choose an unoptimized parameter from
fix_params as our new
cv_params and repeat the process. After optimizing every parameter, we only need to find a good threshold for the final classification. Again, we use a grid search (in this case a simple for loop) and plot the precision/recall against each threshold.
As the recall declines linearly but the precision increases asymptotically, we choose
0.1 as our threshold and get the final classification report:
The classification recall increased from
0.96, while still maintaining a high precision.
In this blog post, we started with the initial goal of extracting contact information of emails in order to automatically update the address books. We specified an objective, ‘extract signature lines’, created a data set and trained different machine learning models with it. With this approach, we end up being able to extract 96% of the relevant lines, so that they can be used in our subsequent contact recognition pipeline.
A take-away message for any machine learning project is the following. Before trying to build a machine learning model, it is important to clearly specify your objective (function) and explore the available data. This will simplify the training process, speed up the development and produce good results.
snapADDY is a technology start-up based in Würzburg, Germany, developing software that helps sales and marketing teams to keep their CRM systems clean and up-to-date.
The company offers two main products: snapADDY Grabber (supporting the in-house sales teams in CRM data maintenance), and snapADDY VisitReport (designed to digitize lead capturing in the field and at trade fairs). In addition, there is a scanner app that has been developed for capturing contact data from business cards and provides a direct CRM connection from the app.
The core of all three software products is an AI-powered contact and address recognition system, which is able to recognize and extract contact information from unstructured text in a wide variety of formats.