Multiprocessing API Calls: A Simple Guide

Published in

Plumbers Of Data Science

3 min readApr 10, 2023

As data volumes are increasing exponentially day by day, processing large datasets in a shorter amount of time is becoming critical. For instance, consider the case of making API calls to a third-party data provider. API calls are essential when we need to extract information from external sources. However, when we need to extract information from multiple sources, API calls can become a bottleneck.

Fortunately, Python has a multiprocessing module that allows you to use multiple processors to speed up the processing of data. It allows us to make multiple API calls simultaneously.

In this blog, we will discuss how to make API calls concurrently using Python’s multiprocessing module. We will use the requests library to make API calls and the PySpark library to process data.

Step 1: Import Required Libraries

First, we need to import the required libraries. We will be using the following libraries:

requests: to make API calls
json: to handle JSON responses
pyspark: to process data
multiprocessing: to make API calls concurrently
itertools: to chain the results of API calls

import json
import pyspark
import multiprocessing as mp
from requests import get as getRequest
from pyspark.sql.functions import col,explode
from pyspark.sql import functions as f
from itertools import chain
from pyspark.sql.types import *

Step 2: Get IDs from a CSV File

We will read IDs from a CSV file using PySpark’s read.csv() function. We will filter the IDs based on the Country column and return a list of IDs.

# Reading input data
IDs = spark.read.csv(Source_File_Path, header = True).filter(col("Country").isin(Country)).rdd.map(lambda x: x[3]).collect()

Step 3: Generate API Token

Before making API calls, we need to generate an API token using an Access Identifier and an Authorization Key. We will use the requests library to make a POST request to the Authentication URL and get the API token.

url = Authentication_URL
param = {
        "accessIdentifier": Authentication_Access_Identifier,
        "privateKey": Authentication_PK
    }
access_token_response = requests.post(
                      url,
                      headers = {'Content-type': 'application/json'},
                      data = json.dumps(param)
                      )
access_token_text = json.loads(access_token_response.text)
auth_token = access_token_text['token']

Step 4: Make API Calls

Now, we will define a function that will make an API call for a given ID and return the JSON response. We will use the requests library to make a GET request and pass the ID and API token in the headers.

We will then define another function that will call the above function for each ID in the ID list and append the JSON response to a list.

def getData(url,ID,token):
    base_url = url+ID
    headers = {
        "Content-Type": 'application/json',
        "Authorization": f"JWT {str(token)}"
    }
    try:
        response = getRequest(base_url, headers=headers)
    except Exception as e:
        return {}
    if response != None and response.status_code == 200:
        jsonResponse = response.json()
        return jsonResponse
    return {}

def getIDData(ID):
    global url
    token = auth_token
    _dataList = []
    jsonDataBatch = getData(url,ID,token)
    _dataList.append(jsonDataBatch)
    return _dataList

Step 5: Implement Multiprocessing

Finally, we will use Python’s multiprocessing module to make API calls concurrently. We will create a multiprocessing pool and use its map() function to call the function defined in Step 4 for each ID in the ID list. The results of the API calls will be returned as a list of lists, which we will chain together to get a final list.

pool = mp.Pool(mp.cpu_count())
results = pool.map(getIDData, [ID for ID in IDs])
pool.close()

_actualDataList = list(chain(*results))

That’s it! Now we have our results in _actualDataList. We can now use PySpark to transform and analyze the data as required.

Conclusion

In this blog, we have discussed how to make API calls concurrently using Python’s multiprocessing module. By making multiple API calls simultaneously, we can significantly improve the performance of our code.