7 types of Artificial Neural Networks for Natural Language Processing

Data Monsters
7 min readSep 26, 2017

by Olga Davydova

What is an artificial neural network? How does it work? What types of artificial neural networks exist? How are different types of artificial neural networks used in natural language processing? We will discuss all these questions in the following article.

An artificial neural network (ANN) is a computational nonlinear model based on the neural structure of the brain that is able to learn to perform tasks like classification, prediction, decision-making, visualization, and others just by considering examples.

An artificial neural network consists of artificial neurons or processing elements and is organized in three interconnected layers: input, hidden that may include more than one layer, and output.

An artificial neural network https://en.wikipedia.org/wiki/Artificial_neural_network#/media/File:Colored_neural_network.svg

The input layer contains input neurons that send information to the hidden layer. The hidden layer sends data to the output layer. Every neuron has weighted inputs (synapses), an activation function (defines the output given an input), and one output. Synapses are the adjustable parameters that convert a neural network to a parameterized system.

Artificial neuron with four inputs http://en.citizendium.org/wiki/File:Artificialneuron.png

The weighted sum of the inputs produces the activation signal that is passed to the activation function to obtain one output from the neuron. The commonly used activation functions are linear, step, sigmoid, tanh, and rectified linear unit (ReLu) functions.

Linear function

f(x)=ax

Step function

Logistic (Sigmoid) Function

Tanh Function

Rectified linear unit (ReLu) function

Training is the weights optimizing process in which the error of predictions is minimized and the network reaches a specified level of accuracy. The method mostly used to determine the error contribution of each neuron is called backpropagation that calculates the gradient of the loss function.

It is possible to make the system more flexible and more powerful by using additional hidden layers. Artificial neural networks with multiple hidden layers between the input and output layers are called deep neural networks (DNNs), and they can model complex nonlinear relationships.

1. Multilayer perceptron (MLP)

A perceptron https://upload.wikimedia.org/wikipedia/ru/d/de/Neuro.PNG

A multilayer perceptron (MLP) has three or more layers. It utilizes a nonlinear activation function (mainly hyperbolic tangent or logistic function) that lets it classify data that is not linearly separable. Every node in a layer connects to each node in the following layer making the network fully connected. For example, multilayer perceptron natural language processing (NLP) applications are speech recognition and machine translation.

2. Convolutional neural network (CNN)

Typical CNN architecture https://en.wikipedia.org/wiki/Convolutional_neural_network#/media/File:Typical_cnn.png

A convolutional neural network (CNN) contains one or more convolutional layers, pooling or fully connected, and uses a variation of multilayer perceptrons discussed above. Convolutional layers use a convolution operation to the input passing the result to the next layer. This operation allows the network to be deeper with much fewer parameters.

Convolutional neural networks show outstanding results in image and speech applications. Yoon Kim in Convolutional Neural Networks for Sentence Classification describes the process and the results of text classification tasks using CNNs [1]. He presents a model built on top of word2vec, conducts a series of experiments with it, and tests it against several benchmarks, demonstrating that the model performs excellent.

In Text Understanding from Scratch, Xiang Zhang and Yann LeCun, demonstrate that CNNs can achieve outstanding performance without the knowledge of words, phrases, sentences and any other syntactic or semantic structures with regards to a human language [2]. Semantic parsing [3], paraphrase detection [4], speech recognition [5] are also the applications of CNNs.

3. Recursive neural network (RNN)

A simple recursive neural network architecture https://upload.wikimedia.org/wikipedia/commons/6/60/Simple_recursive_neural_network.svg

A recursive neural network (RNN) is a type of deep neural network formed by applying the same set of weights recursively over a structure to make a structured prediction over variable-size input structures, or a scalar prediction on it, by traversing a given structure in topological order [6]. In the simplest architecture, a nonlinearity such as tanh, and a weight matrix that is shared across the whole network are used to combine nodes into parents.

4. Recurrent neural network (RNN)

A recurrent neural network (RNN), unlike a feedforward neural network, is a variant of a recursive artificial neural network in which connections between neurons make a directed cycle. It means that output depends not only on the present inputs but also on the previous step’s neuron state. This memory lets users solve NLP problems like connected handwriting recognition or speech recognition. In a paper, Natural Language Generation, Paraphrasing and Summarization of User Reviews with Recurrent Neural Networks, authors demonstrate a recurrent neural network (RNN) model that can generate novel sentences and document summaries [7].

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao created a recurrent convolutional neural network for text classification without human-designed features and described it in Recurrent Convolutional Neural Networks for Text Classification. Their model was compared to existing text classification methods like Bag of Words, Bigrams + LR, SVM, LDA, Tree Kernels, Recursive neural network, and CNN. It was shown that their model outperforms traditional methods for all used data sets [8].

5. Long short-term memory (LSTM)

A peephole LSTM block with input, output, and forget gates. https://upload.wikimedia.org/wikipedia/commons/5/53/Peephole_Long_Short-Term_Memory.svg

Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs [9]. LSTM does not use activation function within its recurrent components, the stored values are not modified, and the gradient does not tend to vanish during training. Usually, LSTM units are implemented in “blocks” with several units. These blocks have three or four “gates” (for example, input gate, forget gate, output gate) that control information flow drawing on the logistic function.

In Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling, Hasim Sak, Andrew Senior, and Françoise Beaufays showed that the deep LSTM RNN architectures achieve state-of-the-art performance for large scale acoustic modeling.

In the work, Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network by Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao, a model for part-of-speech (POS) tagging was presented [10]. The model achieved a performance of 97.40% tagging accuracy. Apple, Amazon, Google, Microsoft and other companies incorporated LSTM as a fundamental element into their products.

6. Sequence-to-sequence models

Usually, a sequence-to-sequence model consists of two recurrent neural networks: an encoder that processes the input and a decoder that produces the output. Encoder and decoder can use the same or different sets of parameters.

Sequence-to-Sequence models are mainly used in question answering systems, chatbots, and machine translation. Such multi-layer cells have been successfully used in sequence-to-sequence models for translation in Sequence to Sequence Learning with Neural Networks study [11].

In Paraphrase Detection Using Recursive Autoencoder, a novel recursive autoencoder architecture is presented. The representations are vectors in an n-dimensional semantic space where phrases with similar meanings are close to each other [12].

7. Shallow neural networks

Besides deep neural networks, shallow models are also popular and useful tools. For example, word2vec is a group of shallow two-layer models that are used for producing word embeddings. Presented in Efficient Estimation of Word Representations in Vector Space, word2vec takes a large corpus of text as its input and produces a vector space [13]. Every word in the corpus obtains the corresponding vector in this space. The distinctive feature is that words from common contexts in the corpus are located close to one another in the vector space.

Summary

In this paper, we described different variants of artificial neural networks, such as deep multilayer perceptron (MLP), convolutional neural network (CNN), recursive neural network (RNN), recurrent neural network (RNN), long short-term memory (LSTM), sequence-to-sequence model, and shallow neural networks including word2vec for word embeddings. We showed how these networks function and how different types of them are used in natural language processing tasks. We demonstrated that convolutional neural networks are primarily utilized for text classification tasks while recurrent neural networks are commonly used for natural language generation or machine translation. In the next part of this series, we will study existing tools and libraries for the discussed neural network types.

Resources

1. http://www.aclweb.org/anthology/D14-1181

2. https://arxiv.org/pdf/1502.01710.pdf

3. http://www.aclweb.org/anthology/P15-1128

4. https://www.aclweb.org/anthology/K15-1013

5. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/CNN_ASLPTrans2-14.pdf

6. https://en.wikipedia.org/wiki/Recursive_neural_network

7. http://www.meanotek.ru/files/TarasovDS(2)2015-Dialogue.pdf

8. https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745/9552

9. https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenTerm1201415/sak2.pdf

10. https://arxiv.org/pdf/1510.06168.pdf

11. https://arxiv.org/pdf/1409.3215.pdf

12. https://nlp.stanford.edu/courses/cs224n/2011/reports/ehhuang.pdf

13. https://arxiv.org/pdf/1301.3781.pdf

--

--