Categorization: Tagging emails

karanjude
MLFeatureEngineering
2 min readJan 17, 2018

Detailed Problem Statement: Assuming you have a corpus of emails. In addition to the corpus you also have associated tags for the emails. Given a new email how can one associate a tag with the new email.

While there are many ways and approaches to tag emails. From simple ones to more sophisticated ones. In this article, we will cover simple features that can help achieve this.

Email Pre processing:

  • Tokenize the email corpus
  • Lower case the email corpus
  • Remove stop words from the email corpus
  • Apply Stemming to the email corpus
  • Apply LDA or LSI on the normalized word corpus

For every new incoming email you now can use the topic modeling vector on the bag of words belonging to the email. This will give you a vector representation of the email or the topic modeling vector for the email.

Tag Pre processing:

  • Collect historical emails belonging to the emails. The bag of words associated with the tag will be the normalized words belonging to the emails. Apply LDA to get the topic model vector for the tag

Features:

  • Cosine similarity between email topic modeling vector AND tag topic modeling vector
  • Number of words in email subject
  • Length of email body
  • Categorical Encoding on From field
  • Boolean flag indicating if the email was replied
  • Boolean flag indicating if the email was forwarded

Feature Set Preparation & Training

For every tag prepare a data set. Every email for which the tag is present has a positive label. For every email for which we do not have the tag set, we mark it as a negative label. For a given tag, train a random forest as a binary classifier. Train N classifiers, one for each tag.

Learning to rank & Point wise ranking

Now assuming we have a new email, we apply all the N classifiers to the email features. We then sort scores in descending order and use the top 3 tags associated with the corresponding scores.

--

--