Multi-threading in C: pthread vs OMP

14 Aug, 2024

Introduction

A few weeks ago, I published a post on n-body simulation programs written in various programming languages, including C, Zig, Rust, and Python. You can read the original post here.

The Limitations of OpenMP

In the C program, I initially used the OpenMP (OMP) library to implement multithreading, which provides a convenient and high-level approach. However, I soon realized that this abstraction comes with a cost. As the saying goes:

there's no such thing as zero cost abstraction!

This experience prompted me to explore alternative approaches to multithreading in C.

Exploring Alternative Approaches

Inspired by an article shared by the administrator of the C community on Xtwitter here, I decided to test the pthread library, which offers a lower-level implementation of multithreading. This approach allows for more fine-grained control and optimization of the code.

OpenMP Implementation

Check the OMP code, note the 1-liner #pragma omp parallel

void bodyForce(Body *p, float dt, int n) {
  #pragma omp parallel for schedule(dynamic)
  for (int i = 0; i < n; i++) { 
    float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f;

    for (int j = 0; j < n; j++) {
      float dx = p[j].x - p[i].x;
      float dy = p[j].y - p[i].y;
      float dz = p[j].z - p[i].z;
      float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING;
      float invDist = 1.0f / sqrtf(distSqr);
      float invDist3 = invDist * invDist * invDist;

      Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3;
    }

    p[i].vx += dt*Fx; p[i].vy += dt*Fy; p[i].vz += dt*Fz;
  }
}

pthread Implementation

pthread implementation, more complex and requires longer lines of code:

the bodyForceThread Function

Calculates the gravitational force between each pair of bodies in a given range. It takes a ThreadData struct as an argument, which contains the array of bodies, the time step, and the start and end indices of the range.

void* bodyForceThread(void* arg) {
    ThreadData* data = (ThreadData*) arg;
    Body* p = data->p;
    float dt = data->dt;
    int start = data->start;
    int end = data->end;

    for (int i = start; i < end; i++) {
        float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f;

        for (int j = 0; j < end; j++) {
            float dx = p[j].x - p[i].x;
            float dy = p[j].y - p[i].y;
            float dz = p[j].z - p[i].z;
            float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING;
            float invDist = 1.0f / sqrtf(distSqr);
            float invDist3 = invDist * invDist * invDist;

            Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3;
        }

        p[i].vx += dt*Fx; p[i].vy += dt*Fy; p[i].vz += dt*Fz;
    }

    return NULL;
}

the bodyForce Function

Calculates the gravitational force between all pairs of bodies. It takes an array of bodies, the time step, and the number of bodies as arguments. It creates multiple threads, each of which calculates the force for a chunk of the bodies. The threads are then joined to ensure that all calculations are complete before the function returns.

void bodyForce(Body *p, float dt, int n) {
    pthread_t threads[NUM_THREADS];
    ThreadData threadData[NUM_THREADS];

    int chunkSize = n / NUM_THREADS;
    for (int i = 0; i < NUM_THREADS; i++) {
        threadData[i].p = p;
        threadData[i].dt = dt;
        threadData[i].start = i * chunkSize;
        threadData[i].end = (i == NUM_THREADS - 1) ? n : (i + 1) * chunkSize;
        pthread_create(&threads[i], NULL, bodyForceThread, &threadData[i]);
    }

    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
}

Performance Comparison

Below is an updated table from the original post, which now includes the results of using pthreads for multithreading in C:

MULTI-THREADED

Language	steps/sec	Billion inter/sec	Notes
Python	0.00383	0.00345	Baseline performance
C - Omp	2.017	1.815	525x ; 525x speedup
C - pthread	3.007	2.706	784x ; 784x speedup
Rust -std	2.256	2.031	588x ; 588x speedup
Rust -Rayon	2.012	1.811	524x ; 524x speedup
Zig	3.146	2.831	820x ; 820x speedup

Conclusion

The results are striking: the C implementation using pthread brings +50% more performance than the original OMP implementation. This brings the performance of the C program within range of the Zig implementation, which is a notable achievement.

Note: compiler flags I'm using:

-O -pthread -lpthread

Please note that I did not use the O3 optimization flag in this test, as it consistently resulted in slower performance.

#C #low level #multi threading #n-body simulation #open mp #programming #pthread