Parallel processing of REST API in python/databricks

Sanajit Ghosh
2 min readMay 30, 2024

Processing 1000s of API pages in parallel can be challenging, especially in sync mode as each time the for-loop waits for the event to get completed. To avoid this dependency of working stuffs in series, async methods with predefined configs can be tuned to get tasks running at a faster rate and avoid the probability of transactional rollbacks. There is an article on how async methods speeds up processes where chunk of activities are needed to be processed at a faster pace.

Below is the code and an use case to process 1000+ pages in a REST API. The async method used is imap_unordered and the pool size is kept 4.

For processing 500 pages there are 50 parallel batches created with each batch doing the processing of 10 pages. As we increase the pool size from 4 to 16 or more, depending on the databricks cluster, the batches will get distributed across multiple processes from 4 processes to 16.

Functionality of imap_unorder is to randomly select batches, and sequentially select 10 pages(based on chunksize) and distribute the load to the worker processors(pool size). The mapping is done in python’s multiprocessing pool API.

from multiprocessing.pool import Pool
import time

def url_with_pages(page):
url = 'https://www.testapi.com/' + str(page)
return url

def total_pages(total_pages):
pages = [page for page in range(1,int(total_pages)+1)]
return pages

# entry point
def main():
start_time = time.time()
totalpages = 500
pagestoprocess = total_pages(totalpages)
print(f'there are total {len(pagestoprocess)} pages in the list {pagestoprocess} \n\n')
with Pool(4) as pool:
for result in pool.imap_unordered(url_with_pages,pagestoprocess, chunksize=10):
print(f'Got result {count}: {result}', flush=True)

print("---Total time taken for completion %s seconds ---" % (time.time() - start_time))

if __name__ == '__main__':
main()

--

--