Multi-threading in Rust: std::thread vs Rayon

23 Aug, 2024

Introduction

This article is the second follow-up on n-body simulation programs written in C, Zig, Rust, and Python. You can read the original post here.

Achievement!

Initially, Rust's performance in the multi-threaded test was not impressive, lagging behind C and Zig. However, after some optimizations, I was able to bring Rust's performance to a similar level. I'm pleased with the outcome, but I must note that achieving this level of performance required a significant amount of effort.

The "Rust way"

While I acknowledge it must be a skill-issue in my part on writing Rust, but there lies the problem: rewriting C code in Rust is more challenging than rewriting it in Zig (which got 1:1 structure with C). To achieve the same results, I often need to deviate from the established path and adopt "the Rust way" of doing things.

Here are some of the optimizations I made compared to the original multi-threaded Rust code:

Made Body struct to derive Copy in addition to Clone for more efficient passing of Body instances.
Used Arc to share the bodies data between threads efficiently.
Adjusted the thread creation logic to ensure all bodies are processed even when n_bodies is not evenly divisible by num_threads.
Combined the force calculation and position update into a single function (body_force) to reduce the number of iterations over the bodies.
Simplified the main loop and thread result collection.
Removed unnecessary vector allocations in the main loop.

std::thread vs Rayon

Okay let's get down to business, here goes the code using plain Rust standard library:

std::thread

Defining and Implementing the Body Struct

use rand::Rng;
use std::thread;
use std::sync::Arc;
use std::time::{Duration, Instant};

const SOFTENING: f32 = 1e-9;
const NBODIES: usize = 30000;

#[derive(Clone, Copy)]
struct Body {
    x: f32,
    y: f32,
    z: f32,
    vx: f32,
    vy: f32,
    vz: f32,
}

impl Body {
    fn new() -> Self {
        Body {
            x: 0.0,
            y: 0.0,
            z: 0.0,
            vx: 0.0,
            vy: 0.0,
            vz: 0.0,
        }
    }
}

randomize_bodies Function

Randomizes the positions and velocities of the bodies in the simulation.

fn randomize_bodies(bodies: &mut [Body]) {
    let mut rng = rand::thread_rng();
    for body in bodies.iter_mut() {
        body.x = 2.0 * rng.gen::<f32>() - 1.0;
        body.y = 2.0 * rng.gen::<f32>() - 1.0;
        body.z = 2.0 * rng.gen::<f32>() - 1.0;
        body.vx = 2.0 * rng.gen::<f32>() - 1.0;
        body.vy = 2.0 * rng.gen::<f32>() - 1.0;
        body.vz = 2.0 * rng.gen::<f32>() - 1.0;
    }
}

body_force Function

Calculates the gravitational force between each pair of bodies in the simulation.

fn body_force(bodies: &[Body], dt: f32, start: usize, end: usize) -> Vec<Body> {
    let mut new_bodies = bodies[start..end].to_vec();
    for i in 0..new_bodies.len() {
        let mut fx = 0.0;
        let mut fy = 0.0;
        let mut fz = 0.0;
        for j in 0..bodies.len() {
            let dx = bodies[j].x - new_bodies[i].x;
            let dy = bodies[j].y - new_bodies[i].y;
            let dz = bodies[j].z - new_bodies[i].z;
            let dist_sqr = dx * dx + dy * dy + dz * dz + SOFTENING;
            let inv_dist = 1.0 / dist_sqr.sqrt();
            let inv_dist3 = inv_dist * inv_dist * inv_dist;
            fx += dx * inv_dist3;
            fy += dy * inv_dist3;
            fz += dz * inv_dist3;
        }
        new_bodies[i].vx += dt * fx;
        new_bodies[i].vy += dt * fy;
        new_bodies[i].vz += dt * fz;
        new_bodies[i].x += new_bodies[i].vx * dt;
        new_bodies[i].y += new_bodies[i].vy * dt;
        new_bodies[i].z += new_bodies[i].vz * dt;
    }
    new_bodies
}

main function

Defining the simulation parameters, initializing the bodies, and running the simulation. It parses the first command-line argument as an integer and assigns it to the n_bodies variable, defaulting to the value of NBODIES if not provided.
The simulation parameters are defined, including the time step dt, the number of simulation iterations n_iters, and the number of threads to use for parallelization num_threads. A vector of Body instances is initialized with random positions and velocities, and the simulation is run for the specified number of iterations.
The simulation is parallelized by dividing the bodies into chunks and assigning each chunk to a separate thread, and the results are collected and used to update the bodies vector.
Finally, the average time per iteration and the average number of interactions per second are calculated and printed to the console.

fn main() {
    let n_bodies = std::env::args().nth(1)
        .and_then(|arg| arg.parse().ok())
        .unwrap_or(NBODIES);

    let dt = 0.01; // time step
    let n_iters = 10; // simulation iterations
    let num_threads = num_cpus::get();
    println!("Number of threads: {}\n", num_threads);

    let mut bodies: Vec<Body> = (0..n_bodies).map(|_| Body::new()).collect();
    randomize_bodies(&mut bodies);

    let mut total_time = Duration::from_secs(0);
    for iter in 1..=n_iters {
        let start_time = Instant::now();

        let bodies_arc = Arc::new(bodies.clone());
        let chunk_size = n_bodies / num_threads;
        let mut handles = vec![];

        for i in 0..num_threads {
            let bodies_clone = Arc::clone(&bodies_arc);
            let start = i * chunk_size;
            let end = if i == num_threads - 1 { n_bodies } else { (i + 1) * chunk_size };
            
            let handle = thread::spawn(move || {
                body_force(&bodies_clone, dt, start, end)
            });
            handles.push(handle);
        }

        bodies = handles.into_iter()
            .flat_map(|h| h.join().unwrap())
            .collect();

        let end_time = Instant::now();
        let t_elapsed = end_time - start_time;
        if iter > 1 {
            total_time += t_elapsed;
        }

        println!("Iteration {}: {:.3} seconds", iter, t_elapsed.as_secs_f64());
    }

    let avg_time = total_time.as_secs_f64() / (n_iters as f64 - 1.0);
    println!(
        "Average rate for iterations 2 through {}: {:.3} steps per second",
        n_iters,
        1.0 / avg_time
    );

    let interaction = 1e-9 * (n_bodies as f64) * (n_bodies as f64) / avg_time;
    println!(
        "{} Bodies: average {:.3} Billion Interactions / second",
        n_bodies, interaction
    );
}

Below is the code using Rayon:

use rand::Rng;
use std::time::{Duration, Instant};
use rayon::prelude::*;

const SOFTENING: f32 = 1e-9;
const NBODIES: usize = 30000;

#[derive(Clone, Copy)]
struct Body {
    x: f32,
    y: f32,
    z: f32,
    vx: f32,
    vy: f32,
    vz: f32,
}

impl Body {
    fn new() -> Self {
        Body {
            x: 0.0,
            y: 0.0,
            z: 0.0,
            vx: 0.0,
            vy: 0.0,
            vz: 0.0,
        }
    }
}

fn randomize_bodies(bodies: &mut [Body]) {
    let mut rng = rand::thread_rng();
    bodies.iter_mut().for_each(|body| {
        body.x = 2.0 * rng.gen::<f32>() - 1.0;
        body.y = 2.0 * rng.gen::<f32>() - 1.0;
        body.z = 2.0 * rng.gen::<f32>() - 1.0;
        body.vx = 2.0 * rng.gen::<f32>() - 1.0;
        body.vy = 2.0 * rng.gen::<f32>() - 1.0;
        body.vz = 2.0 * rng.gen::<f32>() - 1.0;
    });
}

fn body_force(bodies: &[Body], dt: f32) -> Vec<Body> {
    bodies.par_iter().map(|bi| {
        let (mut fx, mut fy, mut fz) = (0.0, 0.0, 0.0);
        for bj in bodies {
            let dx = bj.x - bi.x;
            let dy = bj.y - bi.y;
            let dz = bj.z - bi.z;
            let dist_sqr = dx * dx + dy * dy + dz * dz + SOFTENING;
            let inv_dist = 1.0 / dist_sqr.sqrt();
            let inv_dist3 = inv_dist * inv_dist * inv_dist;

            fx += dx * inv_dist3;
            fy += dy * inv_dist3;
            fz += dz * inv_dist3;
        }

        let mut new_body = *bi;
        new_body.vx += dt * fx;
        new_body.vy += dt * fy;
        new_body.vz += dt * fz;
        new_body.x += new_body.vx * dt;
        new_body.y += new_body.vy * dt;
        new_body.z += new_body.vz * dt;

        new_body
    }).collect()
}

fn main() {
    let n_bodies = std::env::args().nth(1)
        .and_then(|arg| arg.parse().ok())
        .unwrap_or(NBODIES);

    let dt = 0.01; // time step
    let n_iters = 10; // simulation iterations

    let mut bodies: Vec<Body> = (0..n_bodies).map(|_| Body::new()).collect();
    randomize_bodies(&mut bodies);

    println!("Number of threads: {}\n", rayon::current_num_threads());

    let mut total_time = Duration::from_secs(0);
    for iter in 1..=n_iters {
        let start_time = Instant::now();

        bodies = body_force(&bodies, dt);

        let end_time = Instant::now();
        let t_elapsed = end_time - start_time;
        if iter > 1 {
            total_time += t_elapsed;
        }

        println!("Iteration {}: {:.3} seconds", iter, t_elapsed.as_secs_f64());
    }

    let avg_time = total_time.as_secs_f64() / (n_iters as f64 - 1.0);
    println!(
        "Average rate for iterations 2 through {}: {:.3} steps per second",
        n_iters,
        1.0 / avg_time
    );

    let interaction = 1e-9 * (n_bodies as f64) * (n_bodies as f64) / avg_time;
    println!(
        "{} Bodies: average {:.3} Billion Interactions / second",
        n_bodies, interaction
    );
}

Comparative Analysis

As we can see from the two codes above, Rayon as a high-level API for parallelizing computations has significantly fewer lines of code compared to std::thread. Rayon automatically manages threads and synchronization, making it much easier to use.

On the other hand, std::thread is a low-level API that requires manual thread management, including thread creation, synchronization, and communication. This approach is more complex, error-prone, and harder to use.

Results

MULTI-THREADED

Language	steps/sec	Billion inter/sec	Notes
Python	0.00383	0.00345	Baseline performance
C - Omp	2.017	1.815	525x ; 525x speedup
C - pthread	3.007	2.706	784x ; 784x speedup
Rust -std	2.256	2.031	588x ; 588x speedup
Rust -Rayon	2.012	1.811	524x ; 524x speedup
Zig	3.146	2.831	820x ; 820x speedup

Looking at the benchmark results, std::thread has slightly higher performance compared to Rayon, but the difference is only around 10%. Compare this to C multithreading using OMP (high-level) vs pthread (low-level) --> pthread got > +50% more performance vs its OMP counterpart.

This begs question to the practicality of using Rust's std::thread with the low performance gain vs the effort and complexity needed by the low-level API. Unless there are other methods to squeeze more performance out of std::thread, I believe that using Rayon is the best (and much easier) option for implementing parallelism in general.

Note: I'm not using any unsafe blocks in my Rust code, which we often find in some Rust codes for example in this benchmark more specifically in the 4-i.rs code which is the fastest version of Rust n-body program.

#Rayon #blazingly fast #low level #multithreading #programming #rust