Extracting names, emails and phone numbers

Alexander Crosson
3 min readApr 21, 2016

As part of my exploration into natural language processing (NLP), I wanted to put together a quick guide for extracting names, emails, phone numbers and other useful information from a corpus (body of text).

I’ll be using python for this example, but other languages may prove just as useful. The Natural Language Toolkit (NLTK) is great for data extraction. It can be used to pull out organizations, locations, persons and more, from a block of text. We can use regular expressions for extracting emails and phone numbers.

Example body of text:

Hey, This week has been crazy. Attached is my report on IBM. Can you give it a quick read and provide some feedback. Also, make sure you reach out to Claire (claire@xyz.com).You’re the best. Cheers, George W. 212–555–1234

We can use the techniques listed below to extract:

[‘2125551234’]
[‘George W.’]
[‘claire@xyz.com’]

The typical information extraction architecture works as follows:

  1. Segment the body — split the text into an array of sentences
  2. Tokenize — split each sentence into an array of words
  3. Part of Speech Tagging (POS) — tag each word with a grammatical label
  4. Chunking — group and label multi-token sequences

To get started, first lets import NLTK and the stop words. Stop words, in NLP, are words that are filtered out before or after some transformation. NLTK provides a list of stop words in several languages.

import nltk
from nltk.corpus import stopwords
stop = stopwords.words(‘english’)

Then lets remove all stop words and segment the body into sentences.

document = ‘ ‘.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)

We can then tokenize each sentence. Tokenization simply breaks each sentence into an array of words. For example, the sentence “A red ball” tokenized would be [‘a’, ‘red’, ‘ball’].

sentences = [nltk.word_tokenize(sent) for sent in sentences]

Next, we can POS Tag each word in each sentence. NLTK stores tagged words in a set object (ie [(‘a’, ‘DT’), (‘red’, ‘JJ’), (‘ball’, ‘NN’)] where DT is a determiner, JJ is an adjective and NN is a noun). A list of tags and their corresponding values can be found here.

sentences = [nltk.pos_tag(sent) for sent in sentences]

Now that we have segmented, tokenized and tagged our corpus, we can apply a technique known as chunking, where several token-tag pairs are grouped together based on their relationship to one another. NLTK provides a classifier that has been trained to classify named entities. Of course we could have made our own classifier, but for simplicity, we’ll use this out-of-the-box solution.

The default chunking method will add labels to chunks it deems are named entities. Since, we are only looking for persons, we’ll only return chunks labeled “PERSON”, other labels include: ORGANIZATION, LOCATION, DATE…

Note: if a chunk has been classified as a named entity, it will be of type nltk.tree.Tree.

Using the code above we can extract persons who are mentioned within a specific body of text.

Extracting email addresses and phone numbers, proves to be an easier challenge. A simple Google search will return some powerful regular expressions that can return all email and phone numbers found within a string.

I recommend going through NLTK’s documentation to expand on this application and other areas of NLP.

Code used in this example can be found here.

--

--

Alexander Crosson

Curious about Deep Learning, NLP, AI. Hopeful traveler, wannabe chef.