Multi-threading in Python while using a deep learning model to summarize article content
What’s fast, faster, and fastest?
--
Let’s do a simple experiment. We want to write a program that takes five articles and summarizes them. There are many, many ways to do it.
For demonstration, I have created an array of five long strings containing the article content. I copy pasted the content of these five articles into an array of long strings ( ''' '''
) called articles.
- https://engineering.atspotify.com/2023/04/humans-machines-a-look-behind-spotifys-algotorial-playlists/
- https://netflixtechblog.com/improved-alerting-with-atlas-streaming-eval-e691c60dc61e
- https://engineering.razorpay.com/razorpay-cost-optimization-journey-part-1-the-spot-instance-road-cb8d312c9710
- https://www.confluent.io/blog/understanding-and-optimizing-your-kafka-costs-part-2-development-and-operations/
- https://www.reddit.com/r/RedditEng/comments/12xph52/development_environments_at_reddit/
All these articles are from my latest TechFlix curation at the time of writing this post.
To summarize these articles, we can use a simple pre-trained model from Huggingface. I went with a fine-tuned BART model trained on CNN articles. It can be used specifically for summarizing articles. Here’s some simple code we can write. I am using a Jupyter Notebook to run these code blocks.
I am assuming you have already installed the necessary libraries like torch, transformers, etc.
(Note that articles are already defined separately)
# Loading the summarizer model
from transformers import pipeline
device = 0
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=f"cuda:{device}")
In the above code, we just loaded the required model and set it up to use the available GPU (in my case, RTX 3080 with 8 GB memory). In case you don’t use the GPU, it can take an insane amount of time to execute. Next, let us write a code block to measure the execution time to summarize the articles using a simple for loop. We can use a module like timeit to get accurate results.
In the below code, we are simply splitting the long string into chunks to keep the size within the limits of BART (1024 tokens), summarize each chunk and then combine it into a complete summary.
import timeit
# Define the chunk size
chunk_size = 700
# Define the code block to be timed
start_time = timeit.default_timer()
# Iterate over the articles
for item in articles:
# Divide the item into chunks of size 700
chunks = [item[i:i+chunk_size] for i in range(0, len(item), chunk_size)]
# Initialize an empty list to store the summaries for each chunk
summaries = []
# Iterate over the chunks and pass each chunk to the summarizer
for chunk in chunks:
summary = (summarizer(chunk, max_length=48, min_length=8, do_sample=False))[0]['summary_text']
summaries.append(summary)
# Print the final summary for the item
final_summary = ' '.join(summaries)
print(final_summary)
end_time = timeit.default_timer()
print(f"Execution time: {end_time - start_time:.2f} seconds")
Here’s the final output I got.
Execution time: 63.70 seconds
I ran it multiple times to confirm that its execution time was within the same range.
Let’s see how we can rewrite the same functionality using concurrent.futures
and semaphore
. Both these methods help us achieve a sort of parallel processing. At a high level, Semaphore ensures that only a fixed number of threads access the GPU memory simultaneously to prevent out-of-memory errors. Note that we are directly able to do this here because each thread process to summarize an article is independent of any other process.
import timeit
import concurrent.futures
import threading
# Define the chunk size
chunk_size = 700
# Define the code block to be timed
start_time = timeit.default_timer()
# Create a semaphore to limit the number of parallel tasks
semaphore = threading.BoundedSemaphore(5)
# Define a function to summarize an article chunk
def summarize_chunk(chunk):
with semaphore:
return (summarizer(chunk, max_length=48, min_length=8, do_sample=False))[0]['summary_text']
# Iterate over the articles
for item in articles:
# Divide the item into chunks of size 700
chunks = [item[i:i+chunk_size] for i in range(0, len(item), chunk_size)]
# Initialize a thread pool executor
with concurrent.futures.ThreadPoolExecutor() as executor:
# Submit each chunk for summarization
future_summaries = [executor.submit(summarize_chunk, chunk) for chunk in chunks]
# Wait for all the tasks to complete and retrieve the results
summaries = [future.result() for future in concurrent.futures.as_completed(future_summaries)]
# Print the final summary for the item
final_summary = ' '.join(summaries)
print(final_summary)
end_time = timeit.default_timer()
print(f"Execution time: {end_time - start_time:.2f} seconds")
Here’s the final output I got for the above code. I ran the code multiple times once again to confirm that it was within the same range.
Execution time: 47.89 seconds
We can see that the threading method with semaphores is almost ~25% faster than normal for looping while using a GPU. This could translate to hours of time difference at a large scale.
This is a raw and simple example of how you can use concurrent.futures
with a deep learning model on a GPU. I suspect there can be more optimizations you can do to improve efficiency.
Few more approaches to explore further:
- Achieve parallelization at a higher level, like for whole articles instead of chunks.
- When you have large amounts of data, you can experiment with the number of threads, semaphores, and chunk size to find the optimum parameters to maximize efficiency without running into errors.
- Try using the
multiprocessing
module instead ofconcurrent.futures
.
I’ll save them for a future article! #StayTuned