Elhanan Mishraky
3 min readJan 24, 2023

--

Cracking the code on email addresses

By Elhanan Mishraky, Data Scientist, Intuit

In this post, you will learn how to easily get more information from an email address, using Intuit’s open-source package Email Decomposer.

Assume you received an email from johndoe2003@johndoepizza.com, and you don’t have any details about the sender. What can you tell about him?

I assume you will say: ‘This is easy, his name is John Doe, born in 2003, and he owns a pizza shop’.

Interpreting an email address seems to be an easy task for humans. However, it might be tricky to do so programmatically.

Email Structure

Email addresses seem to have a well-defined structure. An email address is made up of a prefix, the symbol @, and a domain.

However, when it comes to the email prefix, there are no rules on if and how names should be encoded in it. This makes it harder to understand if the prefix contains a name, and if so, how to extract the name from the prefix.

Email Prefix Common Patterns

Even though there are no enforced rules for name format in email prefixes, there are common patterns that are frequently used. For example, if the first name is John and the last name is Doe, we can expect the following patterns:

  1. {first name}.{last name} → john.doe
  2. {first name}_{last name} → john_doe
  3. {first name}{last name} → johndoe
  4. {first name} → john
  5. {last name} → doe
  6. {first letter of the first name}{last name} → jdoe
  7. {first letter of the last name}{first name} → djohn

In addition, patterns 1, 2, and 3 can also come in reversed first/last name order.

With these patterns at hand, it is easier to extract names from email addresses. However, there is another challenge: how do you know if the prefix includes a name? What if the email prefix is great_coffee@domain?

US Census Data Sets

Luckily, the US Census Bureau published commonly used first/last names.

We used the following data sets:

  1. First names are taken from the publicly available Census 1990 Names Files. We use the male/female first names. There are 1219 male names and 4275 female names in these files.
  2. Last names are taken from publicly available Frequently Occurring Surnames from the 2010 Census. The list includes 162,254 last names.

Now that we have email prefix common name patterns and common names from US Census Bureau, it seems that we are all set for extracting names from email addresses. But wait, there’s another small issue, using the patterns and the names data sets can result in funny situations. What is the name of billing@domain? Is it Bill Ing? Not really…

Bill Ing or Billing?

To handle the case of billing@domain, we need to first understand that billing is a word and shouldn’t be matched against our name patterns. This is an easy task for NLTK (Natural Language Toolkit).

With NLTK, we can use 2 simple steps to decide if the prefix is a word:

  1. Lemmatize the email prefix. Lemmatization will tell us that the “billing” lemma is simply billing.
  2. Check if the lemma is a dictionary word. In our case, billing is indeed a word.

This way, we’ll just keep “billing” as is and avoid splitting it to Bill Ing.

Intuit’s Email Decomposer Open Source Package

We open-sourced the process described above to allow you to easily extract names from email addresses, using Python programming language. We call it the Email Decomposer.

Get started in just 3 steps:

  1. Install:

pip install email_decomposer

  1. Import:

from email_decomposer.EmailDecomposer import EmailDecomposer

  1. Decompose an email:

EmailDecomposer().decompose(data=’johndue@intuit.com’, get_host=True)

What you will get is:

{‘first_name’: ‘John’, ‘last_name’: ‘Due’, ‘host’: ‘intuit.com’}

Happy Email Decomposing!

--

--