NLP: Python Data Extraction From Social Media, Emails, Documents, Webpages, RSS & Images

Clear Overview Of Python Libraries & Techniques To Fetch Textual Data From All Of The Common Sources

Farhad Malik


The first stage of NLP project is to extract the required textual data. The data is usually unstructured and is stored in a varying number of sources.

This article illustrates how we can extract text based data from the most common sources.

Textual data is fundamental to a NLP based models.

Photo by Raj Eiamworakul on Unsplash

Article Aim

This article will cover text extraction from following sources:

  1. Table From HTML Webpage
  2. Tweets From Twitter
  3. Statuses From Facebook
  4. RSS Feeds
  5. Text From Images
  6. Text From PDF
  7. Text From Word Documents
  8. Text From CSV Files
  9. Text From Excel Files
  10. Text From Outlook Emails
  11. Text From HTML Webpages



Farhad Malik

My personal blog, aiming to explain complex mathematical, financial and technological concepts in simple terms. Contact: