Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

Bird’s-Eye View Of Artificial Intelligence, Machine Learning, Neural Networks & Language Part 3

--

Photo by Dmitry Ratushny on Unsplash

In the previous post we discussed the the various neural network models & how powerful they can be over traditional ML models. In this post, we’ll examine how language can be interpreted by a machine and how the some models especially neural networks can be used to achieve this. Let’s get started.

Part 1: https://taffydas.medium.com/birds-eye-view-of-artificial-intelligence-machine-learning-neural-networks-language-part-1-802b35cf1873

Part 2: https://taffydas.medium.com/birds-eye-view-of-artificial-intelligence-machine-learning-neural-networks-language-part-2-a53d93495de1

Natural Language Processing

Natural Language Processing is the branch of Artificial Intelligence that deals with human language in the form of texts or speech. It involves two main branches: Natural Language Understanding (NLU) and Natural Language Generation (NLG). NLU involves parsing natural language in order to understand it. NLG is the generation of natural language into meaningful phrases or sentences.

Natural Language carries with it a lot of challenges like:

Lexical Ambiguity: The low level roles of words. Eg: The word ‘board’ as a noun or verb

Syntax Ambiguity: These are sentences that could mean multiple things. Eg: “I saw an elephant in my pajamas”

Referential Ambiguity: These are challenges dealing with who/what is being referred to. Eg: “Tom met Alex outside. He was surprised.”

Other challenges of NLP include being able to identify sarcasm, jokes etc

Word Representations

Word representations are ways words can be encoded into machine readable formats. Based on the word representations, feature vectors can be created that map to the word representations. These features may include the word frequency, stemmed word etc.

One-Hot Encoding

This is the basic way of representing words of a document in a matrix. Each unique word is indexed in the matrix and words are identified to be present or absent in a sentence/document with the numbers 1>= or 0 respectively. This is also sometimes called a bag of words. The bag of words can be made even more unique by converting word inflections into their root form, eg: ‘tables’ to ‘table’ or ‘ate’ to ‘eat’. This prevents the repetition of certain words in their various inflections appearing in the matrix. A challenge with one-hot encoding is that with large documents it tends to produce huge matrices that are usually filled with 0’s, known as sparse vectors. This is not an efficient way of storing huge amounts of words.

Word Embedding

The distributional representation of words in a vector space that models similar words and words that often appear together in the same context. These embeddings are usually based on several features/dimensions learned by the model. Words within the same vector space are known to have stronger similarity. Word2Vec from Google or Glove from Stanford are 2 commonly used word embedding models.

Word2Vec

Word2Vec embeddings have 2 distinct training approaches. They are:

CBOW (Continuous Bag Of Words): The method takes context as input and predicts the centre word. For example the sentence “The king rules over them.” will be split into “The”, “king” “over”, “them” as input and “rules” as the prediction word.

Skip-Gram: This is the inverse of CBOW where it takes the centre word as input and the context words as prediction value.

There are different parameters and hyperparameters that can be tweaked to improve the system, including the context window. In the above example the window will be 5, the centre word counts as 1, and 2 words on the opposite sides of the centre word. One rule is that smaller windows (~2–15) provide related words that can be interchangeable in sentences, for example synonyms or even antonyms whereas larger windows provide related words in much larger context without having to be interchangeable in sentences.

Both Skip-Gram and CBOW have their strengths depending on several factors, according to Mikolov, Skip-Gram seems to work better with long sentences, small samples/datasets and rare words. On the other hand CBOW seems to work better with short sentences, large samples/datasets and frequent words. Word2Vec is able to also identify linear/associative properties. For example, the relationship between (man and boy) or (woman and girl). They’re also able to identify analogies based on matrix addition/subtraction. A popular example is the formula King — Man + Woman ~= Queen. The resulting matrix closely corresponds to Queen. The cluster of related words can be very beneficial in fields document classification.

GloVe

Unlike word2vec which uses neural networks, GloVe uses a co-occurrence matrix to explain its semantic analogies. The idea is that the ratio of word probabilities carry stronger meaning than the word probabilities on their own. Word2vec is a predictive supervised model that learns by reducing it’s loss function while Glove is a count-based unsupervised system that observes how words frequently appear around each other (probability scores). The resulting matrix is huge and a way to scale it down (dimensionality reduction) is to factorize it. The resulting matrix is a scaled down version through minimizing the reconstruction loss. The lower dimensional matrix is still able to explain the variance in the original dimension data. Glove considers the entire corpus as opposed to word2vec that looks only at local groupings based on the chosen window. Glove is also able to show relationships like the probabilities of ice co-occurring with solid (high probability), steam co-occurring with gas(high probability), and water co-occurring between both ice and steam(high probability). The calculated ratio between ice and steam will cancel out solid (value >1), gas (value <1) and maintain closely contextual words like water, with values close to 1 (value ~1). Glove has fast training however the challenge is the huge amounts of computer memory required to initially create the co-occurrence matrix. The vector results for context are better than traditional word representation models like word analogies, word similarities and named entity recognition tasks.

Language Model

A language model involves the probability distribution over a sequence of tokens. It’s usually used to predict the next token item and order usually matters.

Note: Word Embeddings are not Language models as such, they are just a spatial representation of words. Order doesn’t matter in these embeddings and they’re often part of the first layer of a neural network, called the embedding layer. The training of word embeddings however implements language models to perform word predictions. Skip-Gram and CBOW (Continuous Bag Of Words) are language models used to train the word2vec embeddings. However the training of word embeddings is not only interested in past tokens to predict a future tokens, it also uses surrounding words as context to predict the centre word or vice versa. In other words, it is bidirectional. Other ways to predict the next token based on a sequence are N-grams and Bidirectional models.

N-gram

This is the sequential representation of words where n is the number of sequential words. The most used ones are bigrams — 2 sequential words, trigrams — 3 sequential words. The higher the n-grams the harder it is to model the data because language is not static, meaning that there are different ways of formatting a sentence to mean the same thing. Looking at 5-gram words might be too much for the model which will tend to overfit the model rather than generalize it. Bag of words can be considered a unigram, since only one word is indexed at each position.

Bigram: Two words in sequence where previous word is to predict next word

Trigram: Three words in sequence where previous 2 words are used to predict next word

Ngram: N words in sequence where previous n-1 words are used to predict next word

Limitation

  • Due to long dependencies of words in language, the ngram approach is not suitable as it only depends on how far back one goes. Going too far back may be an overkill for the model requiring large computation time and too little previous words used to train the model may not get the necessary information to learn from.
  • Mostly works only if test data looks like train data but this is not the case in real life.
  • N-grams are a sparse representation of language. This is because we build the model based on the probability of words co-occurring. It will give zero probability to all the words that are not present in the training corpus which is a hefty price to pay for storage.

Advanced Language Models

Attention

LSTMs and GRUs are improvements on RNN systems and are able to hold information much longer although can be computationally expensive. Attention is another neural network approach where the output produced is only based on the most relevant parts and not the entire sequence, thus paying attention to parts of the input that are considered relevant. For example, consider a large attention weight vector a3,2 attached to an input vector in a language model. The attention for the second input token is high in order to predict the 3rd output token. In a nutshell, the attention indicates what part of an input sequence requires more focus to make the next prediction. Attention is usually normalized so that all weights of the input can be summed to 1 over a distribution. Some applications of attention include language translation where attention mechanisms are applied to parts of an input for each iterative sequence. For conversational models attention is applied to parts of previous parts of the conversation to provide a response. In general. There are different variations of attention including local and global attention. Global attention attends to all hidden states containing all words before producing an output while local attention attends to a given window of words before producing an output. The downside of global attention is the amount of computation time required which is very expensive. Hierarchical attention models attention in a hierarchical format that shows the relevance of attention based on the words on a lower level and sentences on a higher level. Below are other types of attention:

Self Attention

It is a mechanism of relating different tokens in a sequence to calculate their internal representation and how related they are to each other. In self attention the inputs are the query used to decide which other parts of the inputs to attend to. It differs from generic attention where the query inputs are used to decide which parts of another set of data points are important. Self attention is very good for tasks like language translation or sentence parsing — where words are interdependent and relate to several other words in the same sequence. Example : The boy with the artwork said he painted it. In this example you can see more attention can be given to the word he with respect to the boy.

Soft Attention

Soft attention is deterministic meaning that all the hidden states are used to generate the context vector which always produces the same resulting output. Another pro is that soft attention is differentiable. A differentiable function is continuous, has no breaks and is smooth.

Hard Attention

Focusing on one hidden state to generate a context vector. Hard attention is stochastic meaning that the choice of hidden state is dependent on the function being used to generate the hidden vector.

Transformer

Transformers just like RNNs handle sequential data however the model does not require the processing of the input to be done in sequence. This allows for much more parallelization compared to RNNs, therefore transformers are able to run on larger training datasets. Transformers are built on an attention mechanism where the model is able to attend to more important tokens for a given task. The attention mechanism addresses the vanishing gradient problem that can be seen in RNNs. When all the inputs in RNN are processed to generate the context vector, the older input items tend to get ignored and therefore RNNs were initially combined with attention mechanisms. After several improvements it was observed that attention mechanisms on their own produce similar results when combined with RNN and hence the focus on attention only models was increased.

Each transformer consists of encoders and decoders. Encoders process the input into encoded information which indicates parts of the input that are relevant to each other. The output of each encoder goes into the next set of encoders and then the decoder. The decoders do the opposite and decode encoded information into an output sequence. Encoders and decoders have an attention mechanism to specify which parts of the inputs are relevant and weigh them. Decoders have an additional attention layer to find relevant information in the previous output sequence before producing output for the current decoder. Transformers have multi attention heads that each look at different “relevance” items like attending to the next word or main objects of verbs etc. This approach is computationally efficient because different parts of the sentence are analyzed for possible connectedness and other relevant tokens are revealed better than one attention head can. Outputs from each multi attention head are joined and multiplied by a weight output vector to generate the right dimension size required for calculation. The resulting output matrix is passed to a feed forward neural network. The feed forward network consists of 2 hidden layers. The first layer consists of hidden units 4 times the size of the input. This allows the model to accommodate enough representation of the tokens. The second layer scales the huge hidden units size back to the original model dimension.

Encoder

Each encoder consists of a self attention mechanism and a feed forward network. It is meant to provide rich context to words better than normal embeddings, before making predictions in the decoder phase. The self attention layer weighs the relevance among the encodings coming from the previous encoder and outputs a new encoding. The output encoding is transferred to the feed forward network which processes the encodings and transfers them to the next encoder and decoder. The first encoder takes the positional information and word embeddings of the input known as the Input layer. A simple one-hot encoding can be used to represent positional information but more advanced strategies are to use sine or cosine functions as they can handle longer sentences smoothly between range 1 to -1. The combination of the word embeddings and positional vector finally pass the information to the next encoder for processing. Self attention implemented in transformers uses query and key-value pairs to calculate attention. The query is the token in a sequence to be analyzed, the key is a matching token available to the training data set and the value is mapped to the key. The value contains the representations of the tokens themselves and also the context they appear in. A dot product between the query and key provides a matrix that highlights high scores for the focus token and low scores for all other tokens. This is a similarity measure in calculating how much the key is similar to the query. The scores are scaled down by dividing them by the square root of the query and key dimension, this is done to avoid an exploding gradient problem when the dimensions get too big. A softmax is applied to the resulting scaled down matrix which provides the attention score. The resulting matrix is then multiplied with the value to provide the most attended tokens based on the key. Key and values are usually the same in self attention. The attention value output vector is combined with the original positional input embedding known as residual connections. This is to ensure positional information of words are not lost when backpropagating. The output of the residual connections are passed through normalization for faster convergence and then a feed forward network as input for further processing. Residual connections are exceptionally great for very deep networks as some information may get lost along the way, hence it’s a way of ensuring that the information is passed throughout the network.The feed forward networks are made of linear layers connected via ReLU activations. Another round of residual connection and normalization is performed after the feed forward network. Encoders can be stacked against each other, each with their own weights. The final output of all the encoders should contain rich context which can then be provided to the decoder for prediction. The transformer for example has six encoders, after which information is transferred to the decoder.

In summary:

From Jay Alammar’s post (http://jalammar.github.io/illustrated-transformer/) below are image descriptions

Sub-Layers of Encoder

From Jay Alammar’s post (http://jalammar.github.io/illustrated-transformer/) below are image descriptions

In-depth Sub-Layers of Encoder

Other illustrations of the encoding layer shown below:

  1. Input layer to Attention Layer using a concatenated multi head
Source: https://lionbridge.ai/articles/what-are-transformer-models-in-machine-learning/

2. Normalized output from attention layer to feed-forward network

Source: https://lionbridge.ai/articles/what-are-transformer-models-in-machine-learning/

Decoder

The decoder has 3 parts, the self attention mechanism, feed forward network and an attention layer to attend to encoding outputs from the encoder. Final outputs from the decoder stack are produced one at a time and fed back into the decoder stack as input for the next prediction, this is known as Auto Regression. Just like the encoder, the first decoder takes positional information and embeddings from the output sequence as its input. Self attention in the decoder is slightly different. Since words are individually generated one at a time, the self attention mechanism cannot have access to future words when attending to other words. For example, when implementing self-attention in the phrase “I am”, the model cannot have access to the word “fine”. This approach is known as masking (Look Ahead Mask). The mask is a vector made up of 0’s starting from the lower triangle matrix and negative infinities from the top triangle matrix. When combined with the scaled encoder attention scores, the top triangle matrix is masked out. The mask is applied before performing softmax and scaling the scores. The soft mask of the negative infinity values produces 0. The masking also has multi heads and the values for each are concatenated before being passed to the second decoder layer. The decoder also makes predictions based on the current context generated from the encoder. The outputs of the encoder are used as keys and values for the second decoder layer while the outputs of the first decoder layer are used as queries. The output of the multi-headed self-attention in the second decoder is processed and then passed to a feed forward network The final decoder implements a linear transformation and softmax layer to produce the output. Both encoder and decoder can be stacked multiple times, in the research paper from google Attention is all you need, the researchers decided to use 6 encoders and 6 decoders for the transformer although that number can be experimented with. The purpose of stacking is to possibly capture the complexities of language and reveal hierarchical features.

From Jay Alammar’s post (http://jalammar.github.io/illustrated-transformer/) below are image descriptions

Sub-Layers of Encoder and Decoder and their connections

From Jay Alammar’s post (http://jalammar.github.io/illustrated-transformer/) below are image descriptions

6 Stacks of Encoder and Decoder and their connections

Another model known as the Reformer, is claimed to be a more efficient version of the transformer and can handle more data due to certain improvements like hashing, and thus reducing the computation over load the transformer faces during the self attention phase.

The Google Brain team has made available an API, Tensor2Tensor that implements the transformer model for various modules like image captioning, machine translation, parsing etc. This API can be easily implemented and tweaked to solve specific types of problems as mentioned previously.

TransformerXL

The vanilla transformer, although very powerful, still presented some challenges. It addressed the vanishing gradient problem however, it still had issues with long dependencies as the inputs only took 512 tokens at a time. For large documents, this will mean splitting the document and training the model. This caused context fragmentation and limited context dependency as some very important information in earlier parts of the documents couldn’t be accessed. The Transformer tries to merge these inputs by training the model in such a way where after the split input segments are processed by the model, the segments move by one step and the whole process is re-run again. The TransformerXL addresses context fragmentation and limited context dependency by introducing a recurrence mechanism. The recurrence mechanism works by transferring information from the output of the previous segment to the current segment of inputs. This information is concatenated and hence we do not lose the attention dependencies for long range sentences. This approach however, does not allow for absolute positional encoding as in the vanilla transformer because for each round of input segments there are tokens that will have the same positions. Relative positional encoding is introduced at each attention segment rather than introducing positional encoding only at the beginning of the input. TransformerXL achieves higher metrics compared to vanilla transformers or RNNs.

Seq2Seq Model

A sequence-to-sequence model is a model that takes in a sequence input like sentences and outputs a corresponding set of sequences. It is most common in machine translation, text summarization, conversational modeling and image captioning. The model is made up of both encoder and decoder, each containing RNN implementations. The encoder produces the context vector and sends it to the decoder to output the predicted sequences. The RNN takes in the current word embedding and the previous hidden state to operate on the current time step. The challenge of vanishing gradients with long sentences spun the innovation for seq2seq by adding self attention. The decoder in this modified seq2seq model takes all hidden states from the encoder instead of the last hidden state and hence we have more information now going into the decoder. Before each output is produced from the decoder it attends to all hidden states from the encoder and predicts the next sequence.

Semi Supervised Sequence Model

As NLP grew, the context of a sequence of words became even more important in the field. Word embeddings do not fully capture the meaning of words and hence various models have attempted to capture the true contextual meaning of words. A Semi supervised sequence model is an attempt to train sequence information which is auto-encoded to produce the same input sequence as its output. Auto-encoders are trained to compress data to a point where variants/features are still differentiable and then decompress the values to an approximation of the original input values. Non linear auto-encoders produce more fine tuned decompression compared to the linear auto-encoders. This enables the model to be run on a huge text data set to generate suitable context in a compressed format even for longer sentences. Based on the unsupervised training a specific supervised training task like document classification can be applied using models like LSTM. This specific example is referred to as SA-LSTM (Sequence Auto-encoder — LSTM).

Pre-trained Models

BERT

BERT stands for Bidirectional Encoder Representations from Transformers. BERT is based on previous improvements from other language models. These include Semi Supervised Sequence Model, ELMo, ULIMFiT, the Transformer and the OpenAI Transformer. Applications for BERT vary, some of which are sentiment analysis, fact checking or sentence classification. These models are usually trained by minimally modifying the BERT model and mainly training the classifier. This process is known as fine tuning and has its roots from Semi Supervised Sequence Model and ULMFiT.

BERT is a trained Transformer encoder stack and comes in two sizes — BERT Base and BERT Large. The encoder layers are sometimes referred to as transformer blocks. Base has 12 encoders, twice as much as the previously discussed Transformer has. Base also has 768 hidden units in the feed forward network and 12 attention heads. The original Transformer has 512 hidden units and 8 attention heads. BERT Large has 24 encoders, 1024 hidden units and 16 attention heads.

Just like the transformer, BERT takes in word representations. The beginning of the input vector for BERT is recorded as CLS which stands for classification. A self attention mechanism is applied in each encoder which is then passed to the feed forward network. The output of one encoder is then passed to the next encoder. The output vector of all the encoder operations (for example, 768 vector size in BERT Base) is passed to the classifier for training where the CLS token is used to reflect the classifier label. For multi labelling classifications the vector can be tweaked for CLS to accommodate the other possible indexes in the vector.

BERT may not be considered a complete language model because of its architecture. It is not effective at predicting the next word therefore it is not great at language generation for example. The training in the encoder is meant to get a rich sense of context for the word so it is great for other tasks like text classification. It is also great for next sentence prediction because it is partly trained on that. As mentioned earlier, BERT borrowed some of its concepts from other innovative language models because it was quickly realized that word embeddings alone were not capturing the entirety of context, semantics or coreference in a sentence. Some of these models are:

ELMo: ELMo factors in the context of a given word in a sentence hence a word will have a given vector representation based on context in the sentence. ELMo implements bi-directional LSTM on a huge corpus that is trained to predict the next word and also the previous word in a sequence. The output hidden states of the LSTM is a concatenation of all hidden states, a multiplication by weights given the task at hand and a final summation of all the vectors.

ULMFiT: ULMFiT advanced transfer learning by creating methods to effectively use what the model learned in the pre-training process past just word embeddings. It made available ways to fine tune the pre-trained model for effective use in language specific tasks.

OpenAI GPT (Generative Pre-Training Transformer:The introduction of the transformer was seen as a replacement for LSTMs especially since transformers were able to handle longer sentences better. OpenAI transformers found a way of implementing original transformers and making the model ready to be fine tuned for various language specific tasks. OpenAI transformer only utilizes 12 stacks of a transformer decoder to train the model in a predicting task. It does not need the encoder portion as the decoder is ideal enough in making predictions of the next word. There is no encoder-decoder attention layer however there is a self attention layer in each decoder. The output of OpenAI transformer can then be used for downstream tasks like sentence generation & machine translation. The OpenAI transformer is however forward looking although using attention mechanisms while ELMo was bidirectional LSTM.

BERT’s innovation stems from bi directional transformers that are masked for training purposes.

In the training process of BERT, it masks 15% of words. 80% of the masked words are assigned the word MASK, 10% of the masked words are randomly replaced with wrong words in order for the model to predict the right word. 10% of the remaining masked words are left intact. Aside from masked language modeling, BERT is also trained on next sentence prediction in order to learn the relationship between sentences. 50% of the sentences are assigned the actual corresponding second sentence while the other 50% of the corpus are randomly assigned other corresponding second sentences. The labels Next and isNotNext are assigned to the former and latter respectively. BERT can either be fine tuned for specific tasks, eg: altering the input vector. It can also be used to create context vectors for supervised training.

GPT-2

GPT-2 is an improvement on OpenAI GPT model, apart from minimal architectural difference, GPT-2 is trained on a larger corpus and produces more accurate results. One of the main differences between BERT and GPT-2 is that the former uses an encoder stack while the latter uses a decoder stack. BERT uses its encoders to produce a rich context vector based on its bidirectional training. GPT-2 uses its decoder stack to sequentially predict its outputs in an autoregressive manner. Models like XLNet have found ways of using an autoregressive approach while still focusing on contextualizing the input information bidirectionally.

The model begins by taking an input beginning with the start of the sentence tag. Each value in the input vector is used to predict the next token along the path of the current token as it goes through all the decoder layers. The parameter can be tweaked so that instead of going with the highest prediction probability, the model can select the top-k predictions. This sometimes helps the model avoid being stuck in a narrative loop. GPT-2 does not reinterpret previous tokens given the current token. Calculated values and keys of previous words are saved for each iteration when calculating self attention to predict the next token. This is to avoid recalculating these values every time there is an iteration. GPT-2 has shown great promise in applications like machine translation, summarization and music generation.

Inference

Inferencing in natural language involves problem sets containing a pair of sentences — premise and hypothesis. The nlp model given the premise predicts whether a hypothesis is true, this is known as entailment. If the hypothesis is false it’s a contradiction or if the result is undetermined it’s neutral.

Source: http://nlpprogress.com/english/natural_language_inference.html

This concludes the three part series on everything Machine Learning, Neural Networks and NLP(NLU)/NLG. There is still so much more exciting stuff we haven’t been able to cover in the articles but I hope this at least gets you started & opens your eye to the world of possibilities surrounding natural language. Looking forward to your thoughts & comments on this! Thanks you

--

--

Taffy Das
Taffy Das

Written by Taffy Das

Check out more exciting content on new AI updates and intersection with our daily lives. https://www.youtube.com/channel/UCsZRCvdmMPES2b-wyFsDMiA

No responses yet