Concatenating inside a For Loop in Pandas/Python

11 Aug, 2024

Introduction

The past week has been extremely busy due to the turmoil in the equity market, triggered by the sharp decline in the Nikkei 225, Japan's stock market index. As a result, I had to put my programming activities on hold and focus on researching the stock market and looking for bargains.

However, I now have some spare time to resume writing and dive back into programming, halleuya praise the Lord!

A Common Workflow in Pandas

Recently, I worked on a data processing task using Pandas in Python, and I encountered a common workflow that I'm sure many Data Scientists, Data Analysts, and even casual Pandas users can relate to.

Check the 1st Pandas code below:

import os
import pandas as pd

datafiles = os.listdir(r'C:\Users\aabbcc\Desktop\Programming\Data')
datafiles = pd.DataFrame(datafiles)
datafiles.columns = ['Stock']

ALL_management = pd.DataFrame(columns = [
    "Stock",
    "Name",
])

for filename in datafiles.Stock:

    df = pd.read_csv(fr'C:\Users\aabbcc\Desktop\Programming\Data\{filename}')
    y = slice(4)
    short_symbol = filename[y]

    df['Stock'] = short_symbol
    col = df.pop('Stock')
    df.insert(0, 'Stock', col)

    ALL_management = pd.concat([ALL_management, df], axis = 0, ignore_index = True)

In fact, I'm confident that many Pandas users have performed this process countless times, as it's a very common workflow:

Processing data stored in files within a folder
Performing the necessary data processing steps
Combining or concatenating the results into a single new dataset.

Now compare it to the 2nd Pandas code:

datafiles = os.listdir(r'C:\Users\xxyyzz\Desktop\Programming\Data')
datafiles = pd.DataFrame(datafiles)
datafiles.columns = ['Stock']

stocks = []
names = []

for filename in datafiles.Stock:

    df = pd.read_csv(fr'C:\Users\xxyyzz\Desktop\Programming\Data\{filename}')
    names.extend(df["Name"].tolist())

    mgt_len = len(df["Name"])

    y = slice(4)
    stock = filename[y]
    for i in range(mgt_len):
        stocks.append(stock)

ALL_management = pd.concat(
    [pd.DataFrame(stocks), pd.DataFrame(names)], 
    axis = 1, ignore_index = True
    )

ALL_management.columns = ["Stock", "Name"]

Comparing the Two Codes

The second code snippet has a slightly different approach, although it's similar to the first one:

Initializes empty lists before the for loop
Extends or appends the lists with the data
Calls pd.concat outside the loop

Benchmarking Results

Which of these two codes is more efficient? One might assume that the second code is slower due to the nested for loop. However, surprisingly, the second code is actually more performant, with an execution time of 1.4 seconds compared to 2 seconds for the first code.

This is because calling pd.concat within a for loop in the first code is inefficient. Additionally, the first code concatenates the data row-wise, whereas the second code concatenates it column-wise.

Benchmarking Configuration

Note: When conducting these benchmarks, I utilize Hyperfine with the following configuration:

hyperfine --warmup 2 --min-runs 10 --max-runs 10 'program_name'

For low-level languages such as C, Zig, and Rust, I typically set the minimum and maximum runs to 100 instead of 10.

#benchmark #data #high level #pandas #programming #python