*ฅ^•ﻌ•^ฅ* ✨✨  HWisnu's blog  ✨✨ о ฅ^•ﻌ•^ฅ

Multi-threading in Rust: std::thread vs Rayon

Introduction

This article is the second follow-up on n-body simulation programs written in C, Zig, Rust, and Python. You can read the original post here.

Achievement!

Initially, Rust's performance in the multi-threaded test was not impressive, lagging behind C and Zig. However, after some optimizations, I was able to bring Rust's performance to a similar level. I'm pleased with the outcome, but I must note that achieving this level of performance required a significant amount of effort.

The "Rust way"

While I acknowledge it must be a skill-issue in my part on writing Rust, but there lies the problem: rewriting C code in Rust is more challenging than rewriting it in Zig (which got 1:1 structure with C). To achieve the same results, I often need to deviate from the established path and adopt "the Rust way" of doing things.

Here are some of the optimizations I made compared to the original multi-threaded Rust code:

std::thread vs Rayon

Okay let's get down to business, here goes the code using plain Rust standard library:

std::thread

Defining and Implementing the Body Struct

use rand::Rng;
use std::thread;
use std::sync::Arc;
use std::time::{Duration, Instant};

const SOFTENING: f32 = 1e-9;
const NBODIES: usize = 30000;

#[derive(Clone, Copy)]
struct Body {
    x: f32,
    y: f32,
    z: f32,
    vx: f32,
    vy: f32,
    vz: f32,
}

impl Body {
    fn new() -> Self {
        Body {
            x: 0.0,
            y: 0.0,
            z: 0.0,
            vx: 0.0,
            vy: 0.0,
            vz: 0.0,
        }
    }
}

randomize_bodies Function

Randomizes the positions and velocities of the bodies in the simulation.

fn randomize_bodies(bodies: &mut [Body]) {
    let mut rng = rand::thread_rng();
    for body in bodies.iter_mut() {
        body.x = 2.0 * rng.gen::<f32>() - 1.0;
        body.y = 2.0 * rng.gen::<f32>() - 1.0;
        body.z = 2.0 * rng.gen::<f32>() - 1.0;
        body.vx = 2.0 * rng.gen::<f32>() - 1.0;
        body.vy = 2.0 * rng.gen::<f32>() - 1.0;
        body.vz = 2.0 * rng.gen::<f32>() - 1.0;
    }
}

body_force Function

Calculates the gravitational force between each pair of bodies in the simulation.

fn body_force(bodies: &[Body], dt: f32, start: usize, end: usize) -> Vec<Body> {
    let mut new_bodies = bodies[start..end].to_vec();
    for i in 0..new_bodies.len() {
        let mut fx = 0.0;
        let mut fy = 0.0;
        let mut fz = 0.0;
        for j in 0..bodies.len() {
            let dx = bodies[j].x - new_bodies[i].x;
            let dy = bodies[j].y - new_bodies[i].y;
            let dz = bodies[j].z - new_bodies[i].z;
            let dist_sqr = dx * dx + dy * dy + dz * dz + SOFTENING;
            let inv_dist = 1.0 / dist_sqr.sqrt();
            let inv_dist3 = inv_dist * inv_dist * inv_dist;
            fx += dx * inv_dist3;
            fy += dy * inv_dist3;
            fz += dz * inv_dist3;
        }
        new_bodies[i].vx += dt * fx;
        new_bodies[i].vy += dt * fy;
        new_bodies[i].vz += dt * fz;
        new_bodies[i].x += new_bodies[i].vx * dt;
        new_bodies[i].y += new_bodies[i].vy * dt;
        new_bodies[i].z += new_bodies[i].vz * dt;
    }
    new_bodies
}

main function

fn main() {
    let n_bodies = std::env::args().nth(1)
        .and_then(|arg| arg.parse().ok())
        .unwrap_or(NBODIES);

    let dt = 0.01; // time step
    let n_iters = 10; // simulation iterations
    let num_threads = num_cpus::get();
    println!("Number of threads: {}\n", num_threads);

    let mut bodies: Vec<Body> = (0..n_bodies).map(|_| Body::new()).collect();
    randomize_bodies(&mut bodies);

    let mut total_time = Duration::from_secs(0);
    for iter in 1..=n_iters {
        let start_time = Instant::now();

        let bodies_arc = Arc::new(bodies.clone());
        let chunk_size = n_bodies / num_threads;
        let mut handles = vec![];

        for i in 0..num_threads {
            let bodies_clone = Arc::clone(&bodies_arc);
            let start = i * chunk_size;
            let end = if i == num_threads - 1 { n_bodies } else { (i + 1) * chunk_size };
            
            let handle = thread::spawn(move || {
                body_force(&bodies_clone, dt, start, end)
            });
            handles.push(handle);
        }

        bodies = handles.into_iter()
            .flat_map(|h| h.join().unwrap())
            .collect();

        let end_time = Instant::now();
        let t_elapsed = end_time - start_time;
        if iter > 1 {
            total_time += t_elapsed;
        }

        println!("Iteration {}: {:.3} seconds", iter, t_elapsed.as_secs_f64());
    }

    let avg_time = total_time.as_secs_f64() / (n_iters as f64 - 1.0);
    println!(
        "Average rate for iterations 2 through {}: {:.3} steps per second",
        n_iters,
        1.0 / avg_time
    );

    let interaction = 1e-9 * (n_bodies as f64) * (n_bodies as f64) / avg_time;
    println!(
        "{} Bodies: average {:.3} Billion Interactions / second",
        n_bodies, interaction
    );
}

Below is the code using Rayon:

use rand::Rng;
use std::time::{Duration, Instant};
use rayon::prelude::*;

const SOFTENING: f32 = 1e-9;
const NBODIES: usize = 30000;

#[derive(Clone, Copy)]
struct Body {
    x: f32,
    y: f32,
    z: f32,
    vx: f32,
    vy: f32,
    vz: f32,
}

impl Body {
    fn new() -> Self {
        Body {
            x: 0.0,
            y: 0.0,
            z: 0.0,
            vx: 0.0,
            vy: 0.0,
            vz: 0.0,
        }
    }
}

fn randomize_bodies(bodies: &mut [Body]) {
    let mut rng = rand::thread_rng();
    bodies.iter_mut().for_each(|body| {
        body.x = 2.0 * rng.gen::<f32>() - 1.0;
        body.y = 2.0 * rng.gen::<f32>() - 1.0;
        body.z = 2.0 * rng.gen::<f32>() - 1.0;
        body.vx = 2.0 * rng.gen::<f32>() - 1.0;
        body.vy = 2.0 * rng.gen::<f32>() - 1.0;
        body.vz = 2.0 * rng.gen::<f32>() - 1.0;
    });
}

fn body_force(bodies: &[Body], dt: f32) -> Vec<Body> {
    bodies.par_iter().map(|bi| {
        let (mut fx, mut fy, mut fz) = (0.0, 0.0, 0.0);
        for bj in bodies {
            let dx = bj.x - bi.x;
            let dy = bj.y - bi.y;
            let dz = bj.z - bi.z;
            let dist_sqr = dx * dx + dy * dy + dz * dz + SOFTENING;
            let inv_dist = 1.0 / dist_sqr.sqrt();
            let inv_dist3 = inv_dist * inv_dist * inv_dist;

            fx += dx * inv_dist3;
            fy += dy * inv_dist3;
            fz += dz * inv_dist3;
        }

        let mut new_body = *bi;
        new_body.vx += dt * fx;
        new_body.vy += dt * fy;
        new_body.vz += dt * fz;
        new_body.x += new_body.vx * dt;
        new_body.y += new_body.vy * dt;
        new_body.z += new_body.vz * dt;

        new_body
    }).collect()
}

fn main() {
    let n_bodies = std::env::args().nth(1)
        .and_then(|arg| arg.parse().ok())
        .unwrap_or(NBODIES);

    let dt = 0.01; // time step
    let n_iters = 10; // simulation iterations

    let mut bodies: Vec<Body> = (0..n_bodies).map(|_| Body::new()).collect();
    randomize_bodies(&mut bodies);

    println!("Number of threads: {}\n", rayon::current_num_threads());

    let mut total_time = Duration::from_secs(0);
    for iter in 1..=n_iters {
        let start_time = Instant::now();

        bodies = body_force(&bodies, dt);

        let end_time = Instant::now();
        let t_elapsed = end_time - start_time;
        if iter > 1 {
            total_time += t_elapsed;
        }

        println!("Iteration {}: {:.3} seconds", iter, t_elapsed.as_secs_f64());
    }

    let avg_time = total_time.as_secs_f64() / (n_iters as f64 - 1.0);
    println!(
        "Average rate for iterations 2 through {}: {:.3} steps per second",
        n_iters,
        1.0 / avg_time
    );

    let interaction = 1e-9 * (n_bodies as f64) * (n_bodies as f64) / avg_time;
    println!(
        "{} Bodies: average {:.3} Billion Interactions / second",
        n_bodies, interaction
    );
}

Comparative Analysis

As we can see from the two codes above, Rayon as a high-level API for parallelizing computations has significantly fewer lines of code compared to std::thread. Rayon automatically manages threads and synchronization, making it much easier to use.

On the other hand, std::thread is a low-level API that requires manual thread management, including thread creation, synchronization, and communication. This approach is more complex, error-prone, and harder to use.

Results

MULTI-THREADED

Language steps/sec Billion inter/sec Notes
Python 0.00383 0.00345 Baseline performance
C - Omp 2.017 1.815 525x ; 525x speedup
C - pthread 3.007 2.706 784x ; 784x speedup
Rust -std 2.256 2.031 588x ; 588x speedup
Rust -Rayon 2.012 1.811 524x ; 524x speedup
Zig 3.146 2.831 820x ; 820x speedup

Looking at the benchmark results, std::thread has slightly higher performance compared to Rayon, but the difference is only around 10%. Compare this to C multithreading using OMP (high-level) vs pthread (low-level) --> pthread got > +50% more performance vs its OMP counterpart.

This begs question to the practicality of using Rust's std::thread with the low performance gain vs the effort and complexity needed by the low-level API. Unless there are other methods to squeeze more performance out of std::thread, I believe that using Rayon is the best (and much easier) option for implementing parallelism in general.


Note: I'm not using any unsafe blocks in my Rust code, which we often find in some Rust codes for example in this benchmark more specifically in the 4-i.rs code which is the fastest version of Rust n-body program.

#Rayon #blazingly fast #low level #multithreading #programming #rust