Multi-Threading in Python: A Fun Experiment

Exploring efficient code

XQ
The Research Nest
Published in
5 min readApr 16

--

Photo by amirali mirhashemian on Unsplash

Let’s create a fun read-world problem statement to explore.

You have a list of URLs, and you want to download the contents of all these URLs using Python.

Normally, you could use the requests library to download content from a URL. You can write pretty straightforward code to make a for loop and iterate over the array of URLs to download.

The question: Can we implement multithreading to make this process faster? If yes, how and how fast?

Experiment Design:

  1. Let’s create a list of 20 URLs. We can use any websites or web pages we like for this experiment. I personally went with anime pages in MyAnimeList.
  2. Write a Python function to download the contents of a single URL. We can use the requests library for this purpose. The function should take a URL as input and return the contents of that URL as a string.
  3. Write functions for different methods to download data from all URLs.
  4. Measure the time each function takes to download the contents of all URLs. We can use the time module for the same.
  5. Compare them. Analyze the results and conclude the efficiency of multithreading in Python.

Here’s some simple code to conduct this experiment.

import requests
import threading

# Create an array of URLs (each number at the end refers to an anime page)
urls = ['https://myanimelist.net/anime/{}'.format(i) for i in range(1, 21)]

# Create the helper functions
def download_url(url):
response = requests.get(url)
if response.status_code == 200:
return response.text
else:
return None

# Normal method: loop through all URLs
def download_all_urls(urls):
contents = []
for url in urls:
content = download_url(url)
contents.append(content)
return contents

# Multithreading medhod
def download_all_urls_multithreading(urls):
contents = []
threads = []

# Define a function to download the contents of a URL in a thread
def download_thread(url):
content = download_url(url)
contents.append(content)

# Create a thread for each URL and start it
for url in urls:
thread = threading.Thread(target=download_thread, args=(url,))
threads.append(thread)
thread.start()

# Wait for all threads to finish
for thread in threads:
thread.join()

return contents

Let’s try to understand the multithreading method a bit.

A thread is a unit of a process that can run independently and concurrently with other threads. In other words, a process can have multiple threads, each of which can perform its own set of operations independently of the other threads.

In the above code, the download_all_urls_multithreading() function creates a separate thread for each URL using the Thread class from the threading module. These threads then run concurrently and execute the download_url() function to download the contents of each URL simultaneously.

In Python, threads run concurrently through a mechanism called “time-slicing.” This technique allows the operating system to switch rapidly between threads, giving the illusion that they run in parallel. These techniques are sometimes called “Multitasking” or “Context Switching.”

When a program starts a new thread, the operating system assigns a small amount of processing time to each thread, in turn, allowing each thread to execute a small portion of its code. This happens repeatedly, and each thread gets a “slice” of processing time. If there are more threads than processing time available, the operating system will switch between threads more frequently, giving each thread a smaller slice of processing time.

This mechanism of time-slicing makes it possible for threads to run concurrently on a single processor. When one thread is waiting for an input/output operation, such as downloading the contents of a URL, the processor can switch to another thread, allowing it to execute its code. Once the input/output operation is complete, the processor can switch back to the original thread and continue executing its code.

With this understanding, let’s see if the multithreading approach actually gives the desired output.

Here’s the code to measure the time taken to execute both functions and plot the graph of the same.

Do note that sending multiple requests to a website at once can get flagged, ultimately blocking you from accessing the data. So, try to make a lesser number of requests or give a time delay in case you want to send requests to more URLs.

import time
import matplotlib.pyplot as plt

# Measure the time taken by the normal method function
start_time = time.time()
contents_normal = download_all_urls(urls)
end_time = time.time()
time_normal = end_time - start_time

# Print the time taken by the normal function
print('Time taken by normal method: {:.2f} seconds'.format(time_normal))

# Measure the time taken by the multithreading function
start_time = time.time()
contents_multithreading = download_all_urls_multithreading(urls)
end_time = time.time()
time_multithreading = end_time - start_time

# Print the time taken by the multithreading function
print('Time taken by multithreading method: {:.2f} seconds'.format(time_multithreading))

# Set plot size
plt.figure(figsize=(8, 6))

# Create a bar chart
labels = ['Normal Method', 'Multithreading Method']
times = [time_normal, time_multithreading]
plt.bar(labels, times)

# Add annotations to the chart
plt.title('Time Taken to Download Contents of All URLs')
plt.xlabel('Method')
plt.ylabel('Time (Seconds)')
plt.ylim(top=max(times) + 5)
for i, v in enumerate(times):
plt.text(i, v + 1, '{:.2f} s'.format(v), ha='center', va='bottom')

# Display the chart
plt.show()

The multithreading method was ~8X faster!

To summarize, whenever you have a bunch of independent tasks to do, especially if they have some downloading or IO operation, which can be done concurrently, we can use the threading approach described above.

At the same time, note that multithreading may not always give the best results. It can depend on the context of what you are trying to do and the size of the data/iterations you are handling. It can also depend on your tasks and how you implement your threads. Hence, it is important first to verify if you are getting the efficiency you want over a more direct method.

Also, note that some other methods and optimizations can be performed to achieve a similar parallel processing effect. One example is the use of the ThreadPoolExecutor provided by the concurrent.futures module. This method is generally expected to be more efficient than the threading method as it can improve performance by limiting the number of threads created and managed by the program.

I performed the same experiment as above by defining another function using a thread pool.

# ThreadPool Executor method
import concurrent.futures

def download_all_urls_threadpool(urls):
contents = []

# Define a function to download the contents of a URL using the thread pool
def download_thread(url):
content = download_url(url)
return content

# Use a thread pool to download the contents of all URLs
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(download_thread, url) for url in urls]

# Wait for all futures to complete
for future in concurrent.futures.as_completed(futures):
content = future.result()
contents.append(content)
return contents

Computed the execution times for all three approaches.

As we can see, the thread pool method appears a little faster than the normal threading method.

What if I told you that there’s another method that could help us gain even more speed?

#StayTuned for a future article to explore further!

--

--

XQ
The Research Nest

Tech 👨‍💻 | Life 🌱 | Careers 👔 | Poetry 🖊️