Parallel processing pandas dataframes

Introduction

From python 3.2 and forward an excellent module called concurrent.futures is available. It makes it very easy to do multi-threading or multi-processing:

The concurrent.futures module provides a high-level interface for asynchronously executing callables. The asynchronous execution can be performed with threads, using ThreadPoolExecutor, or separate processes, using ProcessPoolExecutor.

Parallel execution of pandas dataframe with a progress bar

In a concrete problem I recently faced I had a large dataframe and a heavy function to execute on each row using a subset of columns from the dataframe. Usually, I would have used the apply method to work through the rows, but apply only uses 1 core of the available cores.

In the made up example below, I am using concurrent.futures to process the dataframe on all available cores. Of course this doesn’t make sense for simple operations as summation below, but for heavy calculations it can make a large impact to use all the available computing power. Also note that I am sending the rows in chunks of 10 to the executor – this reduces the overhead of returning the results.

import tqdm                                                                                                   
import numpy as np
import pandas as pd
import concurrent.futures
import multiprocessing
num_processes = multiprocessing.cpu_count()
 
# Create a dataframe with 1000 rows
df = pd.DataFrame({i: np.random.randint(1,100,size=1000) for i in ['a', 'b', 'c']})
 
# Define a function on the numbers
def func(a, b):
    return a+b
 
# Process the rows in chunks in parallel
with concurrent.futures.ProcessPoolExecutor(num_processes) as pool:
    #df['result'] = list(pool.map(func, df['a'], df['b'], chunksize=10)) # Without a progressbar
    df['result'] = list(tqdm.tqdm(pool.map(func, df['a'], df['b'], chunksize=10), total=df.shape[0])) # With a progressbar

The results are in the same order as the original dataframe and the result of the calculation saved in a separate column.

>>>print(df.head(10))
    a   b   c  result
0  44   7  72      51
1  49  77  30     126
2  74  69  38     143
3  70  25  69      95
4  44  98  20     142
5  10  89  85      99
6  85  47  92     132
7  58  19  16      77
8  46  47  52      93
9  65  87  79     152

Leave a Reply