Member-only story
BigQuery fetching + multiprocessing
Does multiprocessing improve the fetching speed of BigQuery API requests?
BigQuery Storage Read API can be used to fetch data from BigQuery tables. However, at the time I’m writing these lines, no benchmark has been done using this API combined with multiprocessing, in order to process fetching faster.
In this article, I will show some of my research and benchmark that I’ve done in order to find « the most performant way to fetch data from BigQuery ».
Fetching + multiprocessing
The most common and easiest way to fetch data from BigQuery is to process the fetching linearly using only one core on the machine/instance.
However, if you are using a computer or a GCP compute engine (GCE) that has multiple cores, you might wonder if it could be useful to use them in order to process more data in parallel, but there is a key concept to understand:
The fetching will not necessarily be faster with more cores. The time used to fetch some data from the internet depends massively on the internet bandwidth available on your router/network.
Thus, having 200 cores on your device does not mean that your fetching will be processed 200 times faster but 200 processes can be created and…