Leveraging Python — Part I : The concurrent way

Aamir Syed
Analytics Vidhya
Published in
5 min readJul 21, 2020

Python is the top language of choice for tasks relating to Machine Learning or Data Science in general. But beneath the top tier user friendly and intuitive design it boasts, there lies an ugly facet — execution speed.

Concurrency — and no, it is not the same as parallelism
Concurrency — and no, it is not the same as parallelism

Python is slow.

Python is a high level language, hierarchically above C. It is also an interpreted language, meaning it readily executes instructions without having to previously compile into machine level instructions. It abstracts details of the computer from the user : memory management, pointers, etc. A simpler idea to understand is that it is closer to how humans think.

The trade-off for slow speed is visible only when we take development time into account as well. While C may execute faster than Python, the development time for any application or program in C far exceeds that of the latter.

That said, Python still has interesting paradigms that can be made use of to leverage better execution speeds. In particular — concurrency.

What is Concurrency?

In a nutshell, concurrency means working on multiple things at the same time. Note that this is different from parallelism, which is when multiple tasks run at the same time.

Concurrency vs Parallelism
Concurrency is starting multiple tasks and running/completing them in overlapping time periods

Python is a multi-paradigm programming language. Oddly enough, it has a Global Interpreter Lock (GIL), that makes sure that at any given time, only one thread runs. This was done primarily to handle thread-safe executions and to prevent race conditions.

GIL may seem to defeat the purpose of concurrency, but we can still leverage faster execution times by using it to our advantage — during heavy I/O bound operations, for CPU bound highly computational workload, slow website responses during operations, etc.

In this post, we will see how to develop a simple concurrency based program to scrape data from a list of websites. And as a bonus — how to modularize it the object oriented way. Afterall, Python is an OOP language.

Concurrency in Python

Python has a library dedicated to executing calls asynchronously. Asynchronous executions can be performed using ThreadPoolExecutor or the ProcessPoolExecutor for threads and processes respectively. They are implementations of the abstract Executor class. You can read more about them here.

Let us build a simple program to scrape a link, without using any execution speed up.

We first write a function to download content from a website using the requests module in Python. It can be installed easily via pip:
pip3 install requests

Note that we make use of requests’ Session object, which allows certain parameters to persist across requests. It is useful if several requests are being made to the same host, since the same underlying TCP connection would be reused. We make use of requests.Session().get() to get data from a particular web link. We want to save this data to a simple file, which is an I/O operation.

Now that we have the baseline (without concurrency), let us write the same with concurrency. We need to make a few changes to the code.

First, we need the threading and the concurrent.futures module.

  1. When a thread is interrupted or has a heavy I/O operation, the OS would shift to a different thread. Data between threads is shared, and it needs to be protected. Since requests.Session() is not thread-safe, we need alternate strategies like thread local storage or thread-safe data structures.
  2. Since threads also have shared memory, we need to store a local storage instance. For this, we make use of threading.local(). It creates an object specific to each individual thread, and needs to be initialized only once. This object takes care of separating accesses from different threads to different data.
  3. Each thread needs to have its own requests.Session() object. This way, a thread can have persistent storage of parameters w.r.t a particular web link. Combined with the local thread storage, this will help us in making our concurrent processes execute without problems.

With these in mind, a concurrent version can be written easily like this :

Since we deal with I/O operations here, we can see a significant improvement in execution time by making the process concurrent.

Bonus — OOP format of Modularization

Since Python boasts an OOP paradigm as well as a concurrent one, we can use them in tandem to write a simple, clean code module for future uses. Cleaner code helps in understanding how code is grouped, and the format of execution as well. It helps in code re-usability, while making it easier to debug. Always try to clean up your code after writing simple functions for it.

We make a class and set initial attributes for it. Python Class functions have a default self attribute for each function.

But … Why concurrency?

A system has limited resources for computation. The system virtually sits idle and slows down processing during I/O bound and CPU bound operations if concurrency is not leveraged. It is beneficial to make the most of available resources, and concurrency is one such method. Concurrency extends to a single core, and maximizes performance in some cases. In other cases, we may need to opt for true parallelism — using multiprocessing.

Regardless, in practical terms — a large set of Machine Learning and Data Science programming requires data to be scraped and crawled from multiple sources. This means dealing with latency of individual sites, operations on databases, querying, etc. All of these operations are I/O bound, meaning system resources can be utilized better with concurrency. Collating data on a massive scale may involve multiple sources, each adding an overhead of processing time on a program.

Incorporating Concurrent programming paradigms into Python code works seamlessly and alleviates some of the execution speed problems. Concurrency ties in seamlessly into Functional Programming and Data Driven Programming as well, making it exceptionally useful for Machine Learning and Data Mining based programs.

--

--