Multi-threading in C: pthread vs OMP
Introduction
A few weeks ago, I published a post on n-body simulation programs written in various programming languages, including C, Zig, Rust, and Python. You can read the original post here.
The Limitations of OpenMP
In the C program, I initially used the OpenMP (OMP) library to implement multithreading, which provides a convenient and high-level approach. However, I soon realized that this abstraction comes with a cost. As the saying goes:
there's no such thing as zero cost abstraction!
This experience prompted me to explore alternative approaches to multithreading in C.
Exploring Alternative Approaches
Inspired by an article shared by the administrator of the C community on Xtwitter here, I decided to test the pthread library, which offers a lower-level implementation of multithreading. This approach allows for more fine-grained control and optimization of the code.
OpenMP Implementation
Check the OMP code, note the 1-liner #pragma omp parallel
void bodyForce(Body *p, float dt, int n) {
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < n; i++) {
float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f;
for (int j = 0; j < n; j++) {
float dx = p[j].x - p[i].x;
float dy = p[j].y - p[i].y;
float dz = p[j].z - p[i].z;
float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING;
float invDist = 1.0f / sqrtf(distSqr);
float invDist3 = invDist * invDist * invDist;
Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3;
}
p[i].vx += dt*Fx; p[i].vy += dt*Fy; p[i].vz += dt*Fz;
}
}
pthread Implementation
pthread implementation, more complex and requires longer lines of code:
the bodyForceThread Function
Calculates the gravitational force between each pair of bodies in a given range. It takes a ThreadData struct as an argument, which contains the array of bodies, the time step, and the start and end indices of the range.
void* bodyForceThread(void* arg) {
ThreadData* data = (ThreadData*) arg;
Body* p = data->p;
float dt = data->dt;
int start = data->start;
int end = data->end;
for (int i = start; i < end; i++) {
float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f;
for (int j = 0; j < end; j++) {
float dx = p[j].x - p[i].x;
float dy = p[j].y - p[i].y;
float dz = p[j].z - p[i].z;
float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING;
float invDist = 1.0f / sqrtf(distSqr);
float invDist3 = invDist * invDist * invDist;
Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3;
}
p[i].vx += dt*Fx; p[i].vy += dt*Fy; p[i].vz += dt*Fz;
}
return NULL;
}
the bodyForce Function
Calculates the gravitational force between all pairs of bodies. It takes an array of bodies, the time step, and the number of bodies as arguments. It creates multiple threads, each of which calculates the force for a chunk of the bodies. The threads are then joined to ensure that all calculations are complete before the function returns.
void bodyForce(Body *p, float dt, int n) {
pthread_t threads[NUM_THREADS];
ThreadData threadData[NUM_THREADS];
int chunkSize = n / NUM_THREADS;
for (int i = 0; i < NUM_THREADS; i++) {
threadData[i].p = p;
threadData[i].dt = dt;
threadData[i].start = i * chunkSize;
threadData[i].end = (i == NUM_THREADS - 1) ? n : (i + 1) * chunkSize;
pthread_create(&threads[i], NULL, bodyForceThread, &threadData[i]);
}
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}
}
Performance Comparison
Below is an updated table from the original post, which now includes the results of using pthreads for multithreading in C:
MULTI-THREADED
Language | steps/sec | Billion inter/sec | Notes |
---|---|---|---|
Python | 0.00383 | 0.00345 | Baseline performance |
C - Omp | 2.017 | 1.815 | 525x ; 525x speedup |
C - pthread | 3.007 | 2.706 | 784x ; 784x speedup |
Rust -std | 2.256 | 2.031 | 588x ; 588x speedup |
Rust -Rayon | 2.012 | 1.811 | 524x ; 524x speedup |
Zig | 3.146 | 2.831 | 820x ; 820x speedup |
Conclusion
The results are striking: the C implementation using pthread brings +50% more performance than the original OMP implementation. This brings the performance of the C program within range of the Zig implementation, which is a notable achievement.
Note: compiler flags I'm using:
-O -pthread -lpthread
Please note that I did not use the O3 optimization flag in this test, as it consistently resulted in slower performance.