Categorization: Tagging emails

Published in

MLFeatureEngineering

2 min readJan 17, 2018

Detailed Problem Statement: Assuming you have a corpus of emails. In addition to the corpus you also have associated tags for the emails. Given a new email how can one associate a tag with the new email.

While there are many ways and approaches to tag emails. From simple ones to more sophisticated ones. In this article, we will cover simple features that can help achieve this.

Email Pre processing:

Tokenize the email corpus
Lower case the email corpus
Remove stop words from the email corpus
Apply Stemming to the email corpus
Apply LDA or LSI on the normalized word corpus

For every new incoming email you now can use the topic modeling vector on the bag of words belonging to the email. This will give you a vector representation of the email or the topic modeling vector for the email.

Tag Pre processing:

Collect historical emails belonging to the emails. The bag of words associated with the tag will be the normalized words belonging to the emails. Apply LDA to get the topic model vector for the tag

Features:

Cosine similarity between email topic modeling vector AND tag topic modeling vector
Number of words in email subject
Length of email body
Categorical Encoding on From field
Boolean flag indicating if the email was replied
Boolean flag indicating if the email was forwarded

Feature Set Preparation & Training

For every tag prepare a data set. Every email for which the tag is present has a positive label. For every email for which we do not have the tag set, we mark it as a negative label. For a given tag, train a random forest as a binary classifier. Train N classifiers, one for each tag.

Learning to rank & Point wise ranking

Now assuming we have a new email, we apply all the N classifiers to the email features. We then sort scores in descending order and use the top 3 tags associated with the corresponding scores.

Categorization: Tagging emails

Written by karanjude