# Language Models in AI

# Introduction

A simple definition of a Language Model is an AI model that has been trained to predict the next word or words in a text based on the preceding words,

its part of the technology that predicts the next word you want to type on

your mobile phone allowing you to complete the message faster.

The task of predicting the next word/s is referred to as self-supervised learning, it does not need labels it just needs lots of text.

The process applies its own labels to the text.

A language mode can mono linguistic or poly linguistic. Wikipedia suggests that there should be sperate language models for each document collection, however Jeremy and Sebastian found that using the Wikipedia sets

have sufficient overlap that its not necessary.

# Language Models

There is a broad classification of Language Models that fit into two

main groups that are:

**Statistical Language Models:** These models use traditional

statistical techniques like N-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words.

**Neural Language Models:** These are new players in the NLP town and

have surpassed the statistical language models in their effectiveness.

They use different kinds of Neural Networks to model language.

Wikipedia lists the following language models:

# The unigram language model

Unigram: a combination of a one-state finite automata. This model splits

the probabilities of different terms in a context:

This model calculates the probability of each word in the language model as distributed over the document which means that the probability distribution over the model will sum to 1.

The probability of a specific query is:

The probability of the same word in different unigram models of

different documents will be different, for example:

The unigram model is smoothed to avoid P(term) = 0 instances usually by generating a maximum-likelihood for the entire collection an then

linearly interpolate the collection model with the maximum-likelihood model for each document.

# The n-gram language model

This model is the probability P(w₁, …, wm) of discerning the sentence

w₁, …, wm is shown as:

The n-gram model assumes that the probability of encountering the *i*th word, w*i* in the context history of the preceding *i*-1 words can be predicted by

the probability of the shortened context history.

The conditional probability can be calculated from the n-gram model frequency counts:

The terms bigram and trigram language models are models with

n=2 and n=3.

Language models derived directly from frequency counts do not do well

with previously unseen words so n-gram model probabilities are not

derived directly, rather different types of smoothing are used from simple “add-one” to other more sophisticated models such as back-off models or

Good-Turning discounting.

Bidirectional representations condition both pre- and post-context in

all layers.

In the bigram (n = 2) language model the sentence “I saw the red house”

is approximated as:

P(I, saw, the, red, house) ≈

P(I|‹s›)P(saw | I)P(the | saw)P(red | the)P(house | red)P(‹/s› | house)

In a trigram (n = 3) language model it will approximate as:

P(I, saw, the, red, house) ≈

P(I|‹s›, ‹s›)P(saw | ‹s›, I)P(the | I, saw)P(red | saw, the)

P(house | the, red)P(‹/s› | red, house)

Note: ‹s› is a marker denoting the beginning and end of the sentence.

# Exponential language models

Also known as maximum entropy models they encode the

relationship between a word and the n-gram history using feature functions, the equation for this is:

Where

is the partition function, *a* is the parameter vector and

is the feature function. The simplest case uses an indicator with

a certain n-gram.

# Neural networks language models

Neural network language models use continuous representations or embeddings of words to make predictions making use of neural networks. Because neural network language models are trained on larger bodies of texts the vocabulary increases and the number of possible combinations.

Since neural networks are so much larger the words are represented in

a distributed manner as non-linear combinations of weights. In other words, neural network language models approximate the language function and are usually trained as probabilistic classifiers that learn probability distribution.

The model can be trained using standard neural algorithms such as

stochastic gradient descent with back propagation. The context could be

a fixed size window of previous words and the network predicts,

from a feature vector that represents the previous *k* words.

One could also use future words as well as past words as features and

the estimated probability is:

This is referred to “bag-of-words” model. When the feature vectors are combined by continuous operation the model is called a

“continuous bag-of-words” architecture (CBOW). Once can also invert the neural network and make it learn the context of the word the log-probability is maximized.

Where k is the size of the training context and can be the function of the center of the word *wt*. This is the skip-gram language model. Bag-of-words and skip-gram are the basis of the word2vec program.

It is common to use distributed representation encoded into the networks or hidden layers as representations of words, each word is mapped onto

an n-dimensional real vector called “word embedding” where n is the size of the layer just before the output layer.

Skip-gram models have a distinctive characteristic that them model

semantic relations between words as linear combinations and capture

a form of compositionality. An example would be is *v* is the function mapping a word *w* is its *n*-d vector representation then:

*v(king) — v(male) + v(female) ≈ v(queen)*

where *≈ *is made precise by stipulating that its right-hand side must be

the nearest neighbor of the value of the left-hand side

# Uses for Language models

Once the language model has been selected the question of what it can be used for is the next natural question. Some language models are designed

to predict or generate text the applications should require these features so they could be used for:

1. Gmail Smart Compose that suggests words as you type,

2. Siri type question and answering, for many this is one of the

most scifi type tasks,

3. Google Translate — translation is a very complex task if you have used Google translate to translate any technical type of document you will see

how it fails spectacularly. It also struggles with colloquial translations,

NOTE: this website has some cool exercises to use language models to

build the above projects yourself click the link LINK

# Summary

What is a Language Model? Can be answered as follows:

It is the use of statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence.

Language models are using in NLP applications in general and

particularly ones that generate text as an output.

NLP is an exciting and at the cutting edge of ML where practitioners strive

to reduce the errors and improve the abilities of NLP. Language models are the base on which this technology rests, the better the language model

the better the model trains and the more accurate the final result.

# Resources

https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/

https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/