Parallel Processing using the Virtru Python SDK

“If you ain’t first, you’re last.” — Ricky Bobby

Chad Sigler
Virtru Technology Blog
4 min readJan 21, 2020

--

Photo by Tim Trad on Unsplash

I have been working with the Virtru Data Protection Platform to solve various real-world problems related to securing data by using the Virtru SDK. In my effort to secure all the data, I have inadvertently neglected application scaling and performance. Encryption and decryption take time, because…MATH!!! This is my first post related to parallel/concurrent processing related to the Virtru SDK. I will be focusing on the Virtru Python SDK in this article.

Projects

The projects used in this article are:
* Base Project — Single Threaded
* Resulting Project — Multiprocess

Virtru Python SDK Background

The Virtru Python SDK shares the same base SDK as the Virtru C++ SDK. The base SDK is written in C++, which helps consolidate the codebase. This is fantastic as the C++ and the Python SDK both have the same calls and approaches to encryption, decryption, and policy management. This also introduces the dreaded GIL (Global Interpreter Lock).

There are 2 approaches that I will explore:
* Multithreading
* Multiprocessing

To not repeat how the threading and processes work, I am going to reference this article.

Multithreading

Multithreading, in a python context, means a single process of python runs multiple concurrent threads. In certain areas of Python (IO-bound calls) a thread can wait for IO on one thread and another thread can perform CPU bound tasks. This means that a program will probably not scale linearly for every thread added as not all parts of Python are multithread capable. Let’s look at my example for multithreading the base project.

Code

Output
What I did find is that even if the number of threads was increased, the performance did not follow. Test output for 40 files comparing run time to the number of threads:

Watching the output and computation time, the threads looked like they are executing serially, similar to this image:

Multithreaded Bulk Processing

Multiprocessing

Multiprocessing is when multiple processes of Python run which uses more resources but is the only way to accomplish parallel processing using the Virtru Python SDK. After some reading, I figured I would want some control over the concurrency of the application. If for example, I had to encrypt 1 million files, I would probably not be able to create a million python processes on a single host to concurrently encrypt all files. To ensure I didn’t bring my computer to its knees I decided to use a Multiprocessing Pool. By using a pool, I can declare the maximum number of processes I think my computer can support. Other than having to put the proper controls in to ensure there is no contention on files (no file should be processed more than 1 time) I was ready to go.

Code

Output
Bingo! As soon as I started to step up the processes, the performance increased as expected. Test output for 40 files comparing run time to the number of threads:

Watching the output and computation time, the processes looked like they executed similar to this image:

Multiprocess Bulk Process

Global Interpreter Lock

GIL (Global Interpreter Lock) is never invited to the party but tends to show up uninvited. As I was chasing my tail trying to encrypt more files faster using Threading and Async/Await…I began to doubt my python chops, but after talking to one of the Virtru Engineers, we came to the conclusion that it was indeed the GIL.

Conclusion

Little did I know when I started how hard it is to figure these issues out. I thought I was just wrong all over the place and I finally started to use my GoogleFu to my advantage by asking the right question. I came to the same conclusion as this post, but “if you don’t know, now you know”. I started to get some additional help from the Virtru engineering team and after giving them my code to reproduce, what I thought was my complete lack of understanding of threading and multiprocessing and programming in general turn into a confirmation that my hunches were indeed correct, the GIL was to blame.

--

--