How an AI Model Acquires Its Writing Capability (2)?

A Few Hidden Techniques

Eric S. Shi 舍予
Artificial Corner
10 min readApr 23, 2023

--

Photo by Markus Spiske on Unsplash

A few seemingly not-so-earth-shaking techniques have surprisingly resulted in the outstanding performance of the large language models (LLMs), such as Chat GPT, over the previous generation of writing bots. For example,

  1. Large-Scale Pre-Training Technique: One of the key techniques that have contributed to LLM’s performance is large-scale pre-training. It involves training the language model on a large corpus of text data before fine-tuning it for a specific task. This allows the LLM to capture a wide range of language patterns and structures and to learn representations of language that are relevant to a wide range of tasks. Large-scale pre-training is a key pillar of the transformer architecture that most of the LLMs are based on and has been essential for achieving LLM’s outstanding performance.
  2. Masked Language Modeling Technique: Masked language modeling is another technique that has contributed to LLMs’ performance. It involves randomly masking words in the input sequence and then predicting them based on the context of the surrounding words. This technique helps the LLMs to learn more robust language representations and to capture complex relationships between words that may not be apparent from the context of the surrounding words.
  3. Transfer Learning Technique: Transfer learning is yet another key technique that has contributed to the LLMs’ performance. Transfer learning involves transferring knowledge learned from one task to another related task. For example, a generic LLM can be fine-tuned on a language translation task and then transfer the knowledge learned from this task to a text summarization task. This allows the LLM to speed up the learning process and to achieve better performance on a wide range of language tasks.

In addition, the Attention-Based Sequence Modeling Technique is key to many LLMs’ successes, although it is typically deployed during carrying out the actual language tasks, which takes place after the training is done. It helps to improve the LLMs’ performance and generate accurate and relevant language outputs. The attention-based sequence modeling involves attending to different parts of the input sequence with different weights, allowing the LLM to capture more complex relationships between words and generate more nuanced and sophisticated language outputs.

The weights used in attention-based sequence modeling are learned during the LLM training process through a process of backpropagation and gradient descent rather than being assigned by human trainers. During training, the weights are adjusted iteratively based on the performance of the model on a validation set or through other forms of model selection. This process allows the LLM to learn which parts of the input sequence are most important for generating accurate and relevant language outputs and to adjust the weights accordingly.

For example, suppose an LLM is being trained on a language translation task and is given pairs of input and output sentences. In that case, the weights used in attention-based sequence modeling are learned through a process of backpropagation and gradient descent based on how accurately the LLM translates the input sentences to the corresponding output sentences. The weights are adjusted iteratively during training to optimize the LLM’s performance on the validation set, allowing the LLM to learn which parts of the input sequence are most important for generating accurate translations accordingly.

Large-scale pre-training, masked language modeling, transfer learning, attention-based sequence modeling, along with multi-head attention, position-wise feed-forward networks (FFNs), and layer normalization, all contributed to the LLMs outstanding performance and have helped to differentiate the LLMs from the previous generation of writing bots.

Three examples are listed below to provide you with a flavor of how transfer learning is achieved in LLMs:

Example-1 (Language Translation to Text Summarization): In this example, an LLM is first pre-trained on a large corpus of text data using language modeling. Then, the LLM is fine-tuned on a language translation task, where LLM learns to translate text from one language to another. After this, the LLM is fine-tuned on a text summarization task, where the LLM learns to generate summaries of text.

In language translation, the LLM learns to generate a new sentence in a target language that conveys the same meaning as the input sentence in the source language. In order to do this, the LLM must learn to capture the most important information in the input sentence and express it in a way that makes sense in the target language.

Comparatively, in text summarization, the LLM must learn to capture the most important information in a longer piece of text and express it in a shorter summary that conveys the same meaning. This requires the ability to identify the most important information and to express it in a concise and coherent way.

The knowledge learned from the language translation task can be transferred to the text summarization task, as both tasks involve generating natural language outputs. The actual transfer can be achieved via a number of mechanisms, such as:

1.1 Via Attention Mechanisms, where the LLM learns to focus on the most important parts of the input sentence when generating the output sentence.

1.2 Via Representation Learning, where the LLM learns to represent the input sentence as a high-dimensional vector that captures its meaning.

1.3 Via Language Modeling, where the LLM learns to generate natural language outputs that are grammatically correct and make sense in the context of the input sentence.

So, if the knowledge learned from language translation is transferred to text summarization through the use of either attention mechanisms, or representation learning, or language modeling, or any combination of the three, the training load needed to enable the LLM performing text summarization tasks can be dramatically reduced. The resulting LLM is typically efficient and effective in generating higher-quality summaries that capture the most important information.

Example-2 (Named Entity Recognition to Relation Extraction): In this example, the LLM is first pre-trained on a large corpus of text data using language modeling. Then, it is fine-tuned on a named entity recognition task, where the LLM learns to identify named entities (such as people, organizations, and locations) in text. After this, the LLM is fine-tuned on a relation extraction task, where the LLM learns to identify the relationships between the named entities.

In this training sequence, the knowledge learned from the named entity recognition task is transferrable to the relation extraction task, as both tasks involve identifying information about named entities in text.

In practice, the knowledge learned from named entity recognition can be transferred to relation extraction via a number of ways, such as:

2.1 Via Named Entity Embeddings, where the LLM uses the named entity embeddings learned during named entity recognition to represent the named entities during relation extraction. This allows the LLM to identify the relationships between the named entities easily.

2.2 Via Attention Mechanisms, where the attention mechanisms are used in both named entity recognition and relation extraction to identify the most relevant parts of the input text. By focusing on the most relevant parts of the text during relation extraction, the LLM can easily identify the relationships between the named entities.

2.3 Via Fine-Tuning, where the LLM can fine-tune its pre-trained model on a relation extraction dataset after training on a named entity recognition dataset. This allows the LLM to transfer the knowledge learned during named entity recognition to relation extraction.

Embeddings are a way of representing words or phrases as vectors in a high-dimensional space. These vectors capture the semantic and syntactic properties of the words or phrases and can be used as inputs to machine learning models.

Named entity embeddings are a type of embedding that represent named entities, such as people, organizations, and locations, as vectors. These embeddings can be learned during the training process or can be pre-trained on a large corpus of text data. Here are a few application examples of named entity embeddings:

  • GloVe: GloVe is a popular pre-trained embedding model that includes named entity embeddings. It was trained on a large corpus of text data and includes embeddings for named entities.
  • ELMo: ELMo is another pre-trained embedding model that includes named entity embeddings. It uses a deep neural network to generate context-sensitive embeddings that capture the meaning of words in context.
  • BERT: BERT is a pre-trained language model that includes named entity embeddings. It was trained on a large corpus of text data using a masked language modeling objective and included embeddings for named entities.

Named entity embeddings are a powerful tool for natural language processing and can be used in a wide range of applications, including information extraction and text classification.

Example-3 (Sentiment Analysis to Text Classification): In this example, the LLM is first pre-trained on a large corpus of text data using language modeling. Then, it is fine-tuned on a sentiment analysis task, where the LLM learns to classify text as positive, negative, or neutral. After this, the LLM is fine-tuned on a text classification task, where the LLM learns to classify text into different categories (such as news articles or product reviews). The knowledge learned from the sentiment analysis task is transferrable to the text classification task, as both tasks involve analyzing the content of the text to make predictions about its meaning.

In practice, the knowledge learned from sentiment analysis can be transferred to text classification in a few ways, e.g.,

3.1 Via Feature Extraction, where the LLM uses the features learned during sentiment analysis to represent the input text during text classification.

3.2 Via Fine-Tuning, where the LLM can fine-tune its pre-trained model on a text classification dataset after training on a sentiment analysis dataset.

3.3 Via Embeddings, where the LLM uses the embeddings learned during sentiment analysis to represent the input text during text classification.

Sentiment embeddings are a type of embedding that represent the sentiment or emotional content of the text as vectors. These embeddings can be learned during the training process or can be pre-trained on a large corpus of text data. Here are a few application examples of sentiment embeddings:

  • Sent2Vec: Sent2Vec is an embedding model that includes sentiment embeddings. It uses a combination of bag-of-words and neural network techniques to generate embeddings that capture the sentiment of the text.
  • VADER: VADER is a pre-trained model for sentiment analysis that includes sentiment embeddings. It uses a set of rules and heuristics to generate sentiment scores for text and includes embeddings for positive and negative sentiment.
  • FastText: FastText is a pre-trained embedding model that includes sentiment embeddings. It was trained on a large corpus of text data using a skip-gram model and includes embeddings for positive and negative sentiment.

In natural language processing, one common way of representing words as vectors are through word embeddings. They are dense, low-dimensional vectors learned through a neural network-based model, and they can capture the semantic and syntactic relationships among words. Here are two hypothetical examples of word embeddings in the phrases “it is so miserable” and “what a charm,” respectively:

  • “It is so miserable”: In this example, the word “miserable” might have a vector representation that is close to vectors for other negative emotions, such as “sad” and “depressed.” The phrase as a whole might have a vector representation that is relatively far from vectors for positive emotions, such as “happy” and “excited.”
  • “What a charm”: In this example, the word “charm” might have a vector representation that is close to vectors for other positive qualities, such as “beauty” and “elegance.” The phrase as a whole might have a vector representation that is relatively far from vectors for negative qualities, such as “ugliness” and “awkwardness.”

These vector representations (i.e., the embeddings) can be used as inputs to machine learning models for various natural language processing tasks. E.g., in sentiment analysis, the vectors for the words in a sentence can be combined to produce a vector representation for the entire sentence, which can then be used to predict the sentiment of the sentence. In text classification, the vectors for the words in a document can be combined to produce a vector representation of the document, which can then be used to classify the document into one or more categories.

Overall, the above-mentioned techniques are powerful tools for natural language processing and have a wide range of applications, including information extraction, sentiment analysis, and text classification. They allow the LLM to learn more efficiently and effectively and to achieve better performance on many language tasks.

Given that these techniques are deployed here and there before the recent quantum leap of AI model performance, why has the incorporation of these techniques resulted in such a large step-improvement across a wide range of language tasks this time? The short answer is: we don’t know.

Is it crazy to move forward if we don’t fully understand how these techniques achieved what they achieved? Well, to be fair, do we fully understand why taking the “Mathematics-101” can help to improve our examination score in “Physics-101” (or “Physics-201”) in terms of how the electric passages through the neural networks of our brains have been altered by taking the “Mathematics-101”? Or what does the 3D neuron signal firing sequence map look like when facing a “Physics-101” (or “Physics-201”) question after taking the “Mathematics-101”? Ha ha, you see the point?

However, one thing is clear: further research and experiments are needed before any serious and meaningful discussion can be conducted.

Allow me to take this opportunity to thank you for being here! I would not be able to do what I do without people like you who follow along and take that leap of faith to read my postings.

If you like my content, please feel free to press the “Follow” button at the upper-right corner of your screen (below my photo). Once you have pressed the button, Medium will remember to update you on a real-time basis. I can also be contacted on LinkedIn; or Facebook; or Twitter: Twitter.

--

--

Eric S. Shi 舍予
Artificial Corner

Founder of the ES&AG AI Art Studio; built AI bots (ESAG, ESNA, ESMC), running the AI Art Studio with these bots; an artist; a poet; with Ph.D. (USA), MBA (UK).