Practical Text Analysis using Deep Learning

Michael Fire
3 min readFeb 25, 2016

by Michael Fire (originally published in dato.com)

Deep Learning has become a household buzzword these days, and I have not stopped hearing about it. In the beginning, I thought it was another rebranding of Neural Network algorithms or a fad that will fade away in a year. But then I read Piotr Teterwak’s blog post on how Deep Learning can be easily utilized for various image analysis tasks. A powerful algorithm that is easy to use? Sounds intriguing. So I decided to give it a closer look. Maybe it will be a new hammer in my toolbox that can later assist me to tackle new sets of interesting problems.

After getting up to speed on Deep Learning (see my recommended reading list at the end of this post), I decided to try Deep Learning on NLP problems. Several years ago, Professor Moshe Koppel gave a talk about how he and his colleagues succeeded in determining an author’s gender by analyzing his or her written texts. They also released a dataset containing 681,288 blog posts. I found it remarkable that one can infer various attributes about an author by analyzing the text, and I’ve been wanting to try it myself. Deep Learning sounded very versatile. So I decided to use it to infer a blogger’s personal attributes, such as age and gender, based on the blog posts.

Deep Learning Inspired Tools for Text Analysis

I needed a proper Deep Learning algorithm that can analyze text. In my readings, I came across two very interesting Deep Learning inspired tools for text analysis. The first tool was the Word2Vec algorithm invented by Thomas Mikolov et al. from Google, and the second tool was GloVe invented by Jeffrey Pennington et al. from Stanford University. Both of these representation learning algorithms seemed really useful for analyzing text. In the end, I chose to use Word2Vec. It takes a large text corpus as input and outputs a numeric vector representation for each word. The vectors are supposed to represent the semantic similarity between the words. I chose to use Word2Vec due to its outstanding Python implementation written by Radim Řehůřek, and because of an excellent tutorial that was written by Angela Chapman during her internship at Kaggle.

Classify and Regress with GraphLab Create

Now that I settled on a large dataset and a deep learning inspired tool, the next step was to glue all the parts together. I wanted to create an end-to-end machine learning solution that takes a blogger’s posts as input and predicts the blogger’s gender and age with high accuracy. To make all the parts work together, I used GraphLab Create in the following manner.

  1. First, I processed the blog posts using Beautiful Soup, a handy Python package for text scraping. I then loaded the text into an SFrame, a scalable dataframe object in GraphLab Create.
  2. Then, I trained a Word2Vec model on the blog posts. It worked! The trained model “knew” that vector representation of “hehe” is similar to the vector representation of “LOL.” (I found this to be mind-blowing.)
  3. Next, for each blogger, I used Word2Vec to calculate the average vector representation of all words that appeared in the bloggers posts.
  4. Lastly, I used GraphLab Create’s classification toolkit to construct classifiers that takes the average vector of each blogger as input features and predicts the blogger’s gender and age with high accuracy.

The exciting part, I believe, is that the results obtained in this way are better than the known state-of-the-art algorithm for this problem. But I still need to perform additional tests to verify this conclusion.

If you want to try Deep Learning on text, take a look at my IPython notebook that describes in detail how to create a text classifier using Word2Vec and GraphLab Create. You can use it to create your own blogger gender and age classifier, or construct your own deep text classifiers for other NLP tasks.

As always, feel free to leave a comment with any questions or suggestions.

Relevant reading/watch list:

My IPython notebook demonstrating this project.

Piotr Teterwak’s blog post about Deep Features includes a simple primer for deep learning.

Andrew Ng’s Deep Learning talk gives a nice overview of deep learning and what it can achieve using Deep Learning.

There are a lot of great videos and tutorials on neural networks. I watched Stephen Welch’s excellent Neural Networks Demystified videos.

Originally published at blog.dato.com.

--

--