How to Summarize Long Texts Using OpenAI: Improving Coherence and Structure

Tanguyvans
3 min readAug 8, 2023

--

Summarizing lengthy texts using OpenAI has become a straightforward task with the OpenAI API. However, a challenge arises when the text exceeds the maximum number of tokens allowed. In this article, we will explore two methods to address this issue and improve the coherence and structure of the generated summary.

But first, let’s start calling our OpenAI with our API key. We need to import the OpenAI library and to have created an OpenAI API key.

import openai

API_KEY = "INSERT_YOUR_API_KEY"
openai.api_key = API_KEY

The next thing to do is to create a reading function to load our text and a saving function for saving the summarized text. :

def readfile(filename): 
with open(filename, 'r') as f:
text = f.read()
return text

def savefile(filename, text):
with open(filename, 'w') as f:
f.write(text)

On top of that, we also need to create a function to interact with OpenAI’s model. We call that function “generate”. In this function, we only have to define the model of OpenAI we want to interact with and the message we want to send. In return we only have to keep the content of the message:

def generate(message):
model = "gpt-3.5-turbo"
response = openai.ChatCompletion.create(
model=model,
messages=[message]
)
return response["choices"][0]["message"]["content"]

Algorithms

We can now start thinking about the different algorithms. We are going to go through two main solutions:

First method: summarize each part independently.

The first solution would be to split the text into multiple chunks. Then for each chunk, we would ask the API to summarize this part of the text. Then we would join together all the sub-summaries.

# Get the text 
text = readfile('file.txt')

# Define the max lenght of the chunks
chunk_size = 5000
text_chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

subsummaries = []
for chunk in text_chunks:
message = {
"role": "user",
"content": f"summarize this text: {chunk}"
}
subsummary = generate(message)
subsummaries.append(subsummary)

summary = '\n'.join(subsummaries)

savefile('output.txt', summary)

Now we have a summary of the whole text. However, there are a few drawbacks to using this method.

If you have tried it yourself, you may have noticed that transitions between each sub-summaries are not very satisfying. A solution would be to ask OpenAI to improve the coherence of the text. But it can easily become a tedious task.

Another problem is that by summarizing one chunk at a time, we are losing the context and the key ideas of the text.

Despite those problems, this is an easy-to-create solution keeping a low number of requests to OpenAI.

Second method: summarize the text incrementally.

For this second solution, our main goal is to solve the problems encountered with our first solution. We want to have a more coherent and structured summary.

Our solution is to build our summary progressively. Instead of creating multiple sub-summaries and then combining them into one big summary, for each prompt, we are going to provide a chunk of text to summarize and the last 500 tokens of our summary. Then we will ask OpenAI to summarize the chunk of text and add it organically to the current summary.

text = readfile('file.txt')

chunk_size = 5000
text_chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

summary = ''
for chunk in text_chunks:
prompt = f'''
You are currently writing the summary of a text.
Here you have the last 1000 tokens of your summary: {summary[-3000:]}
Summarize this chunk so it can be added to your summary: {chunk}
'''
message = {
"role": "user",
"content": prompt,
}
summary += generate(message)

savefile('output.txt', summary)

This approach offers a more cohesive and coherent summary as the text is summarized incrementally with a contextual reference to the existing summary. By providing relevant context, we improve the flow of the final summary, making it easier to read and understand.

Conclusion

When summarizing long texts using OpenAI, employing an incremental approach (Method 2) provides better results, ensuring coherence and preserving key ideas. By integrating contextual information, the summary becomes more comprehensive and easier to read. While there may still be room for improvement, this method offers a significant enhancement over the traditional chunk-wise approach (Method 1). Remember to experiment with different chunk sizes and prompt formats to find the optimal results for your specific use case. Happy summarizing!

--

--

Tanguyvans

I am an engineer specialized in AI. I am curently working on a PH.D