Language Model: A Survey of the State-of-the-Art Technology

Published in

SyncedReview

13 min readSep 10, 2017

Abstract

The goal of language modelling is to estimate the probability distribution of various linguistic units, e.g., words, sentences etc. In recent years, the application of LM in natural language processing became an interesting field which has attracted many researchers’ attention. In this paper, we present a survey on language models, which mainly consists of count-based and continuous-space language models. Traditional count-based models are described briefly, and then we will focus on continuous-space language models. Although it has been shown that continuous-space language models can obtain good performance, they suffer from some important drawbacks, including a very long training time and limitations on the number of context words. Recently, there have been many extensions to language models designed just to address these problems. This article gives an overview of the most important extensions. Finally, we point out the limitation of current research work and the future direction.

1. Introduction

Language models (LM) can be classiﬁed into two categories: count-based and continuous-space LM. The count-based methods, such as traditional statistical models, usually involve making an n-th order Markov assumption and estimating n-gram probabilities via counting and subsequent smoothing. The LM literature abounds with successful approaches for learning the count based LM: modiﬁed Kneser-Ney smoothing, Jelinek-Mercer smoothing [1,2] etc. In recent years, continuous-space LM such as feed-forward neural probabilistic language models (NPLMs) and recurrent neural network language models (RNNs) are proposed. This Neural Language Models (NLM) solves the problem of data sparsity of the n-gram model, by representing words as vectors (word embeddings) and using them as inputs to a NLM. The parameters are learned as part of the training process. Word embeddings obtained through NLMs exhibit the property whereby semantically close words are likewise close in the induced vector space. Moreover, NLMs can also capture the contextual information at the sentence-level, corpus-level and subword-level.

In this article, our purpose is to give an overview of the most important LM. Each technique is described and its performance on LM, as described in the existing literature, is discussed. Our structured overview makes it possible to detect the most promising techniques in the field of LM.

2. Literature Review

In this section, we will introduce the LM literature including the count-based LM and continuous-space LM, as well as its merits and shortcomings.

2.1 Count-based language models

Using a statistical formulation to describe a LM is to construct the joint probability distribution of a sequence of words. One example is the n-gram model. Based on the Markov assumption, the n-gram LM is developed to address this issue. In n-gram LM, the process of predicting a word sequence is broken up into predicting one word at a time. The LM probability p(w1,w2,…,wn) is a product of word probabilities based on a history of preceding words, whereby the history is limited to m words:

This is also called a Markov chain, where the number of previous states (words here) is the order of the model. The basic idea for n-gram LM is that we can predict the probability of w_(n+1) with its preceding context, by dividing the number of occurrences of w_n, w_(n+1) by the number of occurrences of w_n, which then would be called a bigram. If we would only base on the relative frequency of w_(n+1), this would be a unigram estimator. The estimation of a trigram word prediction probability (most often used for LMs in practical NLP applications) is therefore straightforward, assuming maximum likelihood estimation:

However, when modeling the joint distribution of a sentence, a simple n-gram model would give zero probability to all of the combination that were not encountered in the training corpus, i.e. it would most likely give zero probability to most of the out-of-sample test cases. However, new combinations of n words that were not seen in the training set are likely to occur, thus we do not want to assign such cases zero probability. This problem is called as sparsity. The traditional solution is to use various back-off [1] and smoothing techniques [2, 3], but no good solution exists.

Despite the smoothing techniques mentioned above, and the practical usability of n-gram LM, the curse of dimensionality (as for many other learning algorithms) is especially potent here, as there is a huge number of different combinations of values of the input variables that must be discriminated from each other. For LM, this is the huge number of possible sequences of words, e.g., with a sequence of 10 words taken from a vocabulary of 100,000, there are 10⁵⁰ possible sequences.

There are other drawbacks with n-gram models: They have to rely on exact pattern, i.e. string or word sequence matching, and therefore are in no way linguistically informed. For example, one would wish from a good LM that it can recognize a sequence like “the cat is walking in the bedroom” to be syntactically and semantically similar to “a dog was running in the room”, which cannot be provided by an n-gram model [4]. Another problem of Markovian LM is that dependency beyond the window is ignored. When estimating the parameters of the n-gram model, we only consider context of n-1 words. More specifically, we do not model the true conditional probability under Markov assumption, which is in conflict with the fact that humans can exploit longer context with great success.

These reasons lead to the idea of applying deep learning and Neural Networks to the problem of LM, in hopes of automatically learning such syntactic and semantic features, and to overcome the curse of dimensionality by generating better generalizations with NNs.

2.2 Continuous-space language models

Continuous-space LM is also known as neural language model (NLM). There are two main NLM: feed-forward neural network based LM, which was proposed to tackle the problems of data sparsity; and recurrent neural network based LM, which was proposed to address the problem of limited context. Recently, recurrent neural network based approach have achieved state-of-the-art performance. The early proposed NLM are to solve the aforementioned two main problems of n-gram models. Subsequent works have turned to focus on sub-word modelling and corpus-level modelling based on recurrent neural network and its variant — long short-term memory network (LSTM).

Feed-Forward Neural Network Based Models

The first neural approach to LM is a neural probabilistic language model [4], which learns the parameters of conditional probability distribution of the next word, given the previous n-1 words using a feed-forward neural network of three layers. An overview of the network architecture is additionally given in Figure 1. In this architecture,

Bulid a mapping C from each word i of the vocabulary V to a distributed, real-valued feature vector C(i) ∈R^m, with m being the number of features. Cis a |V| × m matrix, whose row i is the feature vector C(i) for word i.
A function g over words maps the input sequence of feature vectors for words in context (C(w_(t-n+1)),⋯,C(w_(t-1))) to a conditional probability distribution of words in V for the next word w_t.
Finally, simultaneously learn the word feature vectors and the parameters of that probability function with a composite function f, comprised of the two mappings C and g:

Figure 1. An overview of the network architecture of neural probabilistic language model, figure is taken from [4].

In this model, each word in the vocabulary is associated with a distributed word feature vector, and the joint probability function of words sequence is expressed by a function of the feature vectors of these words in the sequence. The model can learn the word feature vectors and the parameters of that probability function simultaneously. This neural network approach can solve the sparseness problem, and have also been shown to generalize well in comparison to the n-gram models in terms of perplexity. However, a major weakness of this approach is the very long training and testing times. Therefore, several other open questions for the future are addressed, mostly concerning speed-up techniques, more compact probability representations (trees), and introducing a-priori knowledge (semantic information etc. to cluster the highly discrete word forms).

Two models [5,6] that were concerned about training and testing speed of NLM were proposed. The basic idea in these papers is to cluster similar words before computing their probability, in order to only have to do one computation per word cluster at the output layer of the NN.

A hierarchical probabilistic NLM [5] is proposed to speed-up training and prediction. A binary hierarchical tree of words in the vocabulary was built using expert knowledge. The binary tree is to form a hierarchical description of a word as a sequence of decisions. Instead of directly predicting each word probability, a hierarchical LM learn to take the hierarchical decisions. This model was two orders of magnitude faster than the non-hierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart.

Another hierarchical LM is the hierarchical log-bilinear (HLBL) model [6], which uses a data-driven method to construct a binary tree of words rather than expert knowledge. The authors first trained a model using a random tree over corpus, then extracted the word representations from the trained model, and performed hierarchical clustering on the extracted representations. In this model, the probability of the next word w is the probability of making the sequences of binary decisions specified by the word’s encoding, given its context. Since what only matters for generating a probability at each node is the predicted feature vector, determined by the context, the probability of the current word can be expressed as a product of probabilities of the binary decisions:

where d_i is the i-th encoding for word w_i, and q_i is the feature vector for the i-th node in the path to the corresponding word encoding. The above probability definition can be extended to multiple encodings per word and a summation over all encodings, which allows better prediction of words with multiple senses in multiple contexts. The best HLBL model reported in [6] reduces perplexity by 11.1% compared to a baseline Kneser-Ney smoothed 5-gram LM, at only 32 minutes training time per epoch.

These continuous models share some common characteristics, in that they are mainly based on feedforward neural network and word feature vectors. This approach doesn’t suffer from the data sparsity problem, since it can be seen as automatically applying smoothing. After introducing hierarchical tree of words, the models can be trained and tested more quickly, and can outperform non-hierarchical neural models as well as the best n-gram model.

Recurrent Neural Network Based Models

We have known that feed-forward neural network based LM use fixed length context. However, recurrent neural network do not use limited size of context. By using recurrent connections, information cay cycle inside these networks for an arbitrary long time. The recurrent neural network based language model (RNNLM) [7] provides further generalization: instead of considering just several preceding words, neurons with input from recurrent connections assumed to represent short term memory. Specifically, the network architecture is given in Figure 2, in this architecture:

Input layer w and output layer y have the same dimensionality as the vocabulary (10K — 200K).
Hidden layer s is orders of magnitude smaller (50–1000 neurons).
U is the matrix of weights between input and hidden layer, V is the matrix of weights between hidden and output layer.
Without the recurrent weights W, this model would be a bigram neural network LM.

A variant [8] of RNNLM was presented to further improve the original RNNLM by decreasing its computational complexity, which was implemented by factorization of the output layer. In this work, simple factorization of the output layer using classes have been implemented. Words are assigned to class proportionally, while respecting their frequencies. The modified RNN model can thus be smaller and faster, both during training and testing, while being more accurate than the basic one.

Figure 2. Recurrent NNLM, taken from [7]

We have introduced the two main neural langugage models. Next, we provide a short overview of the main differences between FNN-based LMs and RNN-based LMs:

When using a FNN, one is restricted to use a fixed context size that has to be determined in advance. RNNs in principle use the whole context, although practical applications indicate that the context size that is effectively used is rather limited. However, RNNs at least have the advantage of not having to make decisions on the context size, a parameter for which a suitable value is very difficult to determine.
Because RNNs are dynamic systems, some issues which cannot arise in FNNs can be encountered. For example, it may happen that the influence of a given input on the network output blows up exponentially as subsequent training examples are presented, a highly undesired artifact.
Comparisons between RNNs and FNNs in applications on statistical LM typically favor RNNs. The reason will become clear in later advanced models.

Advanced Models

Note that NLM are mostly word-level language models up to now. They are blind to subword information (e.g. morphemes). For example, they do not know, a priori, that ‘eventful’, ‘eventfully’, ‘uneventful’ and ‘uneventfully’ should have structurally related embeddings in the vector space. More recently, people have started to focus on subword-level LM and a character-wise NLM [9] was proposed. This NLM relies on character-level inputs through a character-level convolutional neural network, whose output is used as an input to a recurrent NLM. The experiments demonstrate that the model outperforms word-level LSTM baselines with fewer parameters on language with rich morphology (Arabic, Czech, French, German, Spanish, Russian).

At the same time, a gated word-character recurrent LM[10] is presented to address the same issue that information about morphemes such as prefix, root, and suffix is lost, and rare word problems using word-level LM. Unlike the character-wise NLM which only dependent on character-level inputs, this gated word-character RNN LM utilizes both word-level and character-level inputs. In particular, the model has a gate that determines which of the two ways to use to represent each word, that is, whether to derive the word into character-level or the word-level itself. This gate is trained to make this decision based on the input word. The major contribution of this model with this kind of threshold mechanism is that it effectively uses the character-level inputs to better represent rare and out-of-vocabulary words.

Actually, the recurrent LM captures the contextual information (i.e. previous words) implicitly across all preceding words within the same sentence using recurrent neural networks. Besides, the range of context that a vanilla RNN can model is limited, due to the vanishing gradient problem. To achieve larger context, a novel method to incorporate corpus-level discourse information into LM is proposed, which is called larger-context LM [11]. The conventional recurrent LM estimate the sentence-level probability, assuming that all sentences in a document are independent from each other.

However, this assumption of mutual independence of sentences in a corpus is not necessary for the larger context LM. It models the influence of context by defining a conditional probability in term of words from the same sentence, but the context is also composed of a number of previous sentences of arbitrary length. That is to say, the context information is modeled explicitly by context representation of a sequence of preceding sentences. Specifically, authors build a bag-of-words context from the previous sentence, and then integrate it into the Long Short-Term Memory (LSTM). Thus, this model explores another aspect of context-dependent recurrent LM. The larger-context LM improve perplexity for sentences, significantly reducing per-word perplexity compared to the LM without context information. Similarly, in order to incorporate document-level contextual information, a document-context LM[12] is presented. In this work, local and global information is combined into the multi-level recurrent architectures in LM.

3. Conclusion

In this article, we summarized the current work in LM. Based on count-based LM, the NLM can solve the problem of data sparseness, and they are able to capture the contextual information in a range from subword-level to corpus-level. However, a very long training time and large amounts of labeled-training data are the main limitations. Sub-word modelling and large-context LM are the frontier of LM. Future work will investigate the possibility of learning from partially-labeled training data, and the applicability of these models to downstream applications such as summarization and translation.

References

[1] R. Kneser and H. Ney. Improved backing-off for n-gram language modeling. In International Conference on Acoustics, Speech and Signal Processing, pages 181–184, 1995.
[2] S.F. Chen and J.T. Goodman. An empirical study of smoothing techniques for language modeling. Computer, Speech and Language, 13(4):359–393, 1999.
[3] J .Goodman. A bit of progress in language modeling. Technical Report MSR-TR-2001–72, Microsoft Research, 2001. [4] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.
[5] FredericMorin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Robert G. Cowell and Zoubin Ghahramani, editors, AISTATS’05, pages 246–252, 2005.
[6] A. Mnih, G. Hinton. A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems 21, MIT Press, 2009.
[7] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, S. Khudanpur. Recurrent neural network based language model, In: Proceedings of Interspeech, pages 462–466, 2010.
[8] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, S. Khudanpur. Extensions of recurrent neural network language model, In Proceedings of ICASSP, pages 253–258, 2011.
[9] Y. Kim, Y. Jernite, D. Sontag, AM Rush. Character-Aware Neural Language Models. In Thirtieth AAAI Conference on Artificial Intelligence, pages 381–389, 2016
[10] Y. Miyamoto and K. Cho. Gated Word-Character Recurrent Language Model . Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1992–1997, Austin, Texas, November 1–5, 2016.
[11] T. Wang and K. Cho. Larger-Context Language Modelling with Recurrent Neural Network. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1319– 1329, Berlin, Germany, August 7–12, 2016.
[12] Y. Ji, T. Cohn, L. Kong, C. Dyer, J. Eisenstein . Document Context Language Models. In Proceedings of the International Conference on Learning Representations, pages 148–154, 2016

Author: Kejin Jin | Editor: Joni Chung | Localized by Synced Global Team: Xiang Chen