Image for post
Image for post

Working with Text

Text is one of the most prevailing forms of sequence data. In order for deep learning models to understand natural language, one must prepare an statistical representation of the language. Deep learning for natural language processing is pattern recognition applied to words, sentences and paragraphs. Raw text must be vectorized in order to be used; this can be done in multiple ways:

  1. Segment text into words and transform each word into a vector.
  2. Segment text into characters and transform each character into a vector
  3. Extract n-grams of words or characters and then transform each n-gram into a vector.

In the previous example for text vectorization I used one-hot encoding of tokens. In my second attempt in building a sentiment analysis engine for movie reviews, I’m going to use word embedding of tokens. Word embedding is a dense word vector. While the vectors obtained through one-hot encoding are binary, sparse, and very high dimensional, word embedding are low dimensional floating point vectors. Simply put, word embedding is a way of representing text where each word in the vocabulary is represented by a real value vector in a high dimensional space. …


Image for post
Image for post

Working with Text

Text is one of the most prevailing forms of sequence data. In order for deep learning models to understand natural language, one must prepare an statistical representation of the language. Deep learning for natural language processing is pattern recognition applied to words, sentences and paragraphs. Raw text must be vectorized in order to be used; this can be done in multiple ways:

  1. Segment text into words and transform each word into a vector.
  2. Segment text into characters and transform each character into a vector
  3. Extract n-grams of words or characters and then transform each n-gram into a vector.

In the previous example for text vectorization I used one-hot encoding of tokens. In my second attempt in building a sentiment analysis engine for movie reviews, I’m going to use word embedding of tokens. Word embedding is a dense word vector. While the vectors obtained through one-hot encoding are binary, sparse, and very high dimensional, word embedding are low dimensional floating point vectors. Simply put, word embedding is a way of representing text where each word in the vocabulary is represented by a real value vector in a high dimensional space. …


Image for post
Image for post

In the last article we completed the creation of the vocabulary used for our dataset, which is going to be used for for building NLP model. In our first attempt we will develop a Multilayer Perceptron (MLP) model to classify encoded review feedbacks as either positive sentiment or negative sentiment. The model will be a simple feedforward network model with fully a connected layer called Dense in the Keras deep learning library.

What is fully connected dense layer? It is a linear operation in which every input is connected to every output by a weight (so there are n_inputs * n_outputs weights — which can be a lot!), …

About

Dejan Jovanovic

Seasoned executive, business and technology leader, entrepreneur, blockchain and smart contract expert

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store