Multi-threading in Rust: std::thread vs Rayon
Introduction
This article is the second follow-up on n-body simulation programs written in C, Zig, Rust, and Python. You can read the original post here.
Achievement!
Initially, Rust's performance in the multi-threaded test was not impressive, lagging behind C and Zig. However, after some optimizations, I was able to bring Rust's performance to a similar level. I'm pleased with the outcome, but I must note that achieving this level of performance required a significant amount of effort.
The "Rust way"
While I acknowledge it must be a skill-issue in my part on writing Rust, but there lies the problem: rewriting C code in Rust is more challenging than rewriting it in Zig (which got 1:1 structure with C). To achieve the same results, I often need to deviate from the established path and adopt "the Rust way" of doing things.
Here are some of the optimizations I made compared to the original multi-threaded Rust code:
- Made Body struct to derive Copy in addition to Clone for more efficient passing of Body instances.
- Used Arc to share the bodies data between threads efficiently.
- Adjusted the thread creation logic to ensure all bodies are processed even when n_bodies is not evenly divisible by num_threads.
- Combined the force calculation and position update into a single function (body_force) to reduce the number of iterations over the bodies.
- Simplified the main loop and thread result collection.
- Removed unnecessary vector allocations in the main loop.
std::thread vs Rayon
Okay let's get down to business, here goes the code using plain Rust standard library:
std::thread
Defining and Implementing the Body Struct
use rand::Rng;
use std::thread;
use std::sync::Arc;
use std::time::{Duration, Instant};
const SOFTENING: f32 = 1e-9;
const NBODIES: usize = 30000;
#[derive(Clone, Copy)]
struct Body {
x: f32,
y: f32,
z: f32,
vx: f32,
vy: f32,
vz: f32,
}
impl Body {
fn new() -> Self {
Body {
x: 0.0,
y: 0.0,
z: 0.0,
vx: 0.0,
vy: 0.0,
vz: 0.0,
}
}
}
randomize_bodies Function
Randomizes the positions and velocities of the bodies in the simulation.
fn randomize_bodies(bodies: &mut [Body]) {
let mut rng = rand::thread_rng();
for body in bodies.iter_mut() {
body.x = 2.0 * rng.gen::<f32>() - 1.0;
body.y = 2.0 * rng.gen::<f32>() - 1.0;
body.z = 2.0 * rng.gen::<f32>() - 1.0;
body.vx = 2.0 * rng.gen::<f32>() - 1.0;
body.vy = 2.0 * rng.gen::<f32>() - 1.0;
body.vz = 2.0 * rng.gen::<f32>() - 1.0;
}
}
body_force Function
Calculates the gravitational force between each pair of bodies in the simulation.
fn body_force(bodies: &[Body], dt: f32, start: usize, end: usize) -> Vec<Body> {
let mut new_bodies = bodies[start..end].to_vec();
for i in 0..new_bodies.len() {
let mut fx = 0.0;
let mut fy = 0.0;
let mut fz = 0.0;
for j in 0..bodies.len() {
let dx = bodies[j].x - new_bodies[i].x;
let dy = bodies[j].y - new_bodies[i].y;
let dz = bodies[j].z - new_bodies[i].z;
let dist_sqr = dx * dx + dy * dy + dz * dz + SOFTENING;
let inv_dist = 1.0 / dist_sqr.sqrt();
let inv_dist3 = inv_dist * inv_dist * inv_dist;
fx += dx * inv_dist3;
fy += dy * inv_dist3;
fz += dz * inv_dist3;
}
new_bodies[i].vx += dt * fx;
new_bodies[i].vy += dt * fy;
new_bodies[i].vz += dt * fz;
new_bodies[i].x += new_bodies[i].vx * dt;
new_bodies[i].y += new_bodies[i].vy * dt;
new_bodies[i].z += new_bodies[i].vz * dt;
}
new_bodies
}
main function
- Defining the simulation parameters, initializing the bodies, and running the simulation. It parses the first command-line argument as an integer and assigns it to the n_bodies variable, defaulting to the value of NBODIES if not provided.
- The simulation parameters are defined, including the time step dt, the number of simulation iterations n_iters, and the number of threads to use for parallelization num_threads. A vector of Body instances is initialized with random positions and velocities, and the simulation is run for the specified number of iterations.
- The simulation is parallelized by dividing the bodies into chunks and assigning each chunk to a separate thread, and the results are collected and used to update the bodies vector.
- Finally, the average time per iteration and the average number of interactions per second are calculated and printed to the console.
fn main() {
let n_bodies = std::env::args().nth(1)
.and_then(|arg| arg.parse().ok())
.unwrap_or(NBODIES);
let dt = 0.01; // time step
let n_iters = 10; // simulation iterations
let num_threads = num_cpus::get();
println!("Number of threads: {}\n", num_threads);
let mut bodies: Vec<Body> = (0..n_bodies).map(|_| Body::new()).collect();
randomize_bodies(&mut bodies);
let mut total_time = Duration::from_secs(0);
for iter in 1..=n_iters {
let start_time = Instant::now();
let bodies_arc = Arc::new(bodies.clone());
let chunk_size = n_bodies / num_threads;
let mut handles = vec![];
for i in 0..num_threads {
let bodies_clone = Arc::clone(&bodies_arc);
let start = i * chunk_size;
let end = if i == num_threads - 1 { n_bodies } else { (i + 1) * chunk_size };
let handle = thread::spawn(move || {
body_force(&bodies_clone, dt, start, end)
});
handles.push(handle);
}
bodies = handles.into_iter()
.flat_map(|h| h.join().unwrap())
.collect();
let end_time = Instant::now();
let t_elapsed = end_time - start_time;
if iter > 1 {
total_time += t_elapsed;
}
println!("Iteration {}: {:.3} seconds", iter, t_elapsed.as_secs_f64());
}
let avg_time = total_time.as_secs_f64() / (n_iters as f64 - 1.0);
println!(
"Average rate for iterations 2 through {}: {:.3} steps per second",
n_iters,
1.0 / avg_time
);
let interaction = 1e-9 * (n_bodies as f64) * (n_bodies as f64) / avg_time;
println!(
"{} Bodies: average {:.3} Billion Interactions / second",
n_bodies, interaction
);
}
Below is the code using Rayon:
use rand::Rng;
use std::time::{Duration, Instant};
use rayon::prelude::*;
const SOFTENING: f32 = 1e-9;
const NBODIES: usize = 30000;
#[derive(Clone, Copy)]
struct Body {
x: f32,
y: f32,
z: f32,
vx: f32,
vy: f32,
vz: f32,
}
impl Body {
fn new() -> Self {
Body {
x: 0.0,
y: 0.0,
z: 0.0,
vx: 0.0,
vy: 0.0,
vz: 0.0,
}
}
}
fn randomize_bodies(bodies: &mut [Body]) {
let mut rng = rand::thread_rng();
bodies.iter_mut().for_each(|body| {
body.x = 2.0 * rng.gen::<f32>() - 1.0;
body.y = 2.0 * rng.gen::<f32>() - 1.0;
body.z = 2.0 * rng.gen::<f32>() - 1.0;
body.vx = 2.0 * rng.gen::<f32>() - 1.0;
body.vy = 2.0 * rng.gen::<f32>() - 1.0;
body.vz = 2.0 * rng.gen::<f32>() - 1.0;
});
}
fn body_force(bodies: &[Body], dt: f32) -> Vec<Body> {
bodies.par_iter().map(|bi| {
let (mut fx, mut fy, mut fz) = (0.0, 0.0, 0.0);
for bj in bodies {
let dx = bj.x - bi.x;
let dy = bj.y - bi.y;
let dz = bj.z - bi.z;
let dist_sqr = dx * dx + dy * dy + dz * dz + SOFTENING;
let inv_dist = 1.0 / dist_sqr.sqrt();
let inv_dist3 = inv_dist * inv_dist * inv_dist;
fx += dx * inv_dist3;
fy += dy * inv_dist3;
fz += dz * inv_dist3;
}
let mut new_body = *bi;
new_body.vx += dt * fx;
new_body.vy += dt * fy;
new_body.vz += dt * fz;
new_body.x += new_body.vx * dt;
new_body.y += new_body.vy * dt;
new_body.z += new_body.vz * dt;
new_body
}).collect()
}
fn main() {
let n_bodies = std::env::args().nth(1)
.and_then(|arg| arg.parse().ok())
.unwrap_or(NBODIES);
let dt = 0.01; // time step
let n_iters = 10; // simulation iterations
let mut bodies: Vec<Body> = (0..n_bodies).map(|_| Body::new()).collect();
randomize_bodies(&mut bodies);
println!("Number of threads: {}\n", rayon::current_num_threads());
let mut total_time = Duration::from_secs(0);
for iter in 1..=n_iters {
let start_time = Instant::now();
bodies = body_force(&bodies, dt);
let end_time = Instant::now();
let t_elapsed = end_time - start_time;
if iter > 1 {
total_time += t_elapsed;
}
println!("Iteration {}: {:.3} seconds", iter, t_elapsed.as_secs_f64());
}
let avg_time = total_time.as_secs_f64() / (n_iters as f64 - 1.0);
println!(
"Average rate for iterations 2 through {}: {:.3} steps per second",
n_iters,
1.0 / avg_time
);
let interaction = 1e-9 * (n_bodies as f64) * (n_bodies as f64) / avg_time;
println!(
"{} Bodies: average {:.3} Billion Interactions / second",
n_bodies, interaction
);
}
Comparative Analysis
As we can see from the two codes above, Rayon as a high-level API for parallelizing computations has significantly fewer lines of code compared to std::thread. Rayon automatically manages threads and synchronization, making it much easier to use.
On the other hand, std::thread is a low-level API that requires manual thread management, including thread creation, synchronization, and communication. This approach is more complex, error-prone, and harder to use.
Results
MULTI-THREADED
Language | steps/sec | Billion inter/sec | Notes |
---|---|---|---|
Python | 0.00383 | 0.00345 | Baseline performance |
C - Omp | 2.017 | 1.815 | 525x ; 525x speedup |
C - pthread | 3.007 | 2.706 | 784x ; 784x speedup |
Rust -std | 2.256 | 2.031 | 588x ; 588x speedup |
Rust -Rayon | 2.012 | 1.811 | 524x ; 524x speedup |
Zig | 3.146 | 2.831 | 820x ; 820x speedup |
Looking at the benchmark results, std::thread has slightly higher performance compared to Rayon, but the difference is only around 10%. Compare this to C multithreading using OMP (high-level) vs pthread (low-level) --> pthread got > +50% more performance vs its OMP counterpart.
This begs question to the practicality of using Rust's std::thread with the low performance gain vs the effort and complexity needed by the low-level API. Unless there are other methods to squeeze more performance out of std::thread, I believe that using Rayon is the best (and much easier) option for implementing parallelism in general.