Personalized text generation -Part I

Sowmiya Jaganathan
6 min readSep 10, 2023

--

LLMs have made searching better by understanding context and generating query responses or summaries based on user queries. Fine-tuning LLM for personalized text generation, in particular, holds tremendous potential across various use cases like storytelling experiences with personalized narratives, engaging interactions through dialogue agents, generating personalized notes, and much more.

Let’s see an example to understand the problem statement:

This is the original text by the Author:

Hey there! So, I stumbled upon this amazing taco joint the other day. 
Let me tell you, those tacos were so good, they practically did the salsa
dance in my mouth!

Generic Text without Personalization:

I found a great taco restaurant the other day. The tacos were delicious, 
and I recommend trying them if you're in the area.

This text is more generic and formal. Did not capture the author’s writing style.

Text Generated with Personalization:

"Hey, folks! So, I discovered this fantastic taco spot recently, and 
let me tell you, those tacos were so scrumptious, that they practically
tangoed on my taste buds!"

Here the generated text maintains the author’s informal and humorous style, using similar expressions and tone.

This new paper “Teach LLM to personalize” proposes an approach to achieve the above problem statement with a multi-stage framework. Using the Retriever, Ranker, Summarization, and Synthesis inputs to fine-tune a model to generate the personalized text.

Overview:

To generate personalized text, we will use the user’s past documents to understand their unique writing style, phrases, and key elements.

We will collect these documents using a Retriever and Ranking system. Here, to get the relevant documents/candidates, the query would be the immediate context which is the title and a short start of the current document. The current document represents the ongoing writing activity of the user. Once we have the ranked entries, we extract the important information with two techniques i.e., Summarization and Synthesis.

Let’s understand in detail,

Retrieval:

Using the retrieval to fetch similar documents written by the author/user in the past. Two strategies were experimented with:

Ranking:

Once we have the relevant documents, we rank the candidates/results.

  • Sparse Retriever —Document Level: rank the documents based on the BM25 score.
  • Dense Retriever — Document Level: ranked using embedding similarity scores with the immediate document.
  • Dense Retriever — Snippet Level: Results retrieved at the snippet level might be relevant but may not give the full picture of the given snippet as it is truncated to 250 characters per snippet. To get meaningful information, re-ranked the documents that contain the snippets. This addresses the lack of diversity in the documents and more relevant information for generation.

Finally, once we have all the entries, the ranked entries are concatenated as strings and truncated to 2,500 characters. As a result, the input of ranked results with the same length is fed to the model as different modules.

Recall:

Now that we have the relevant past documents, we want to capture the high-level aspects, useful phrases, key elements, and topics from the user past’s document to better understand the user’s writing style and preferences.

To get the above information we are going to further process the retrieved candidates to get the summarised context and the keywords that were used in the tasks.

Summarisation:

For summarization, we are two strategies were experimented with, context-independent and context-dependent summarization.

Context Independent Summarisation — Fine-tuning the model using the public summarization dataset.

Context-dependent Summarisation — Considers the immediate context and relevant documents to create weak labels to fine-tune the model. To generate weak labels, we get the snippets from the ranked entries using the immediate context. Then concatenate all the snippets as a label.

Teach LLM to personalize

How does this Context-dependent summarization help?

It helps the model focus on the important context information that is likely to be used in the current document. This makes the generated summaries more precise and relevant compared to general summarization.

Finally, we use the immediate context and relevant documents with the joined snippets to fine-tune the model.

Synthesis

What is the Synthesis step?

The synthesis step helps by identifying common key elements from the top retrieved entries. This allows for a comprehensive understanding of the current writing task. By extracting keywords as a method for synthesis, it becomes easier to identify important information and generate personalized text based on that understanding.

Similar to summarization, two strategies were experimented with, context-independent and context-dependent Synthesis.

Context-independent — From the ranked entries, extract the frequent terms(unigram) and sort them in descending order; Remove stopwords, words with a frequency of one, and words with small inverse document frequency.

Context-dependent — Identify the important keywords by calculating the similarity between the words in the current documents and ranked entries. Here the two words are considered to be similar if they match one of the conditions,

  • Both words are identical,
  • close in embedding space,
  • Both words are synonyms in WordNet.

Once we have the keyword list, they are sorted based on the frequency values followed by IDF.

In the end, concatenate the list of words to be used as training examples.

How does this Context-dependent summarization help?

Here, we can clearly see that the keywords are extracted according to the given context(current documents) and that can be used for generating the current documents instead of blindly sending all the keywords.

Finally, we use the immediate context and relevant documents with the joined target word list to fine-tune the model.

Personalized Generation:

To train the personalized generation model, the current document, ranked entries from past documents, and summaries and synthesis are used as input. These multiple inputs help the model to shape the output, and we are using the ground-truth current document as the label(target output).

To differentiate between different information sources, prefixes are prepended to each input. For instance, “passage start” is added as a prefix for the immediate context, “summary” for the summary, “important words” for synthesis, and “past passages” for ranked results. So here these multiple inputs help the model to shape the output to achieve the task.

Multitask Learning

As mentioned in the paper, writing and reading skills are highly correlated. In addition to the personalized text generation task, a reading comprehension task was also added to enhance the model’s ability to comprehend an author’s style.

In this reading comprehension task, the model is presented with a pair of documents and is tasked with determining whether they were written by the same author or not.

The training process involves the positive and negative training examples for the model to better understand the task and information.

To differentiate between the tasks with the model, the instruction “Finish the passage in the user’s voice” is used for the personalized generation task, while the instruction for the author distinction task is “Predict whether two passages are from the same author.

Let's understand the model used for training and Experiment details in the next part. Stay tuned!

References:

Teach LLMs to personalized paper

--

--