SIMD vs Scalar: Unlocking the Power of Parallel Processing
Introduction
This is a follow up on my previous post "Struct of Arrays" which is part of a series on Data Oriented Design implementation.
SIMD - Single Instruction Multiple Data
A parallel processing technique executing a single instruction on multiple data elements simultaneously, increasing computational speed and efficiency.
Why SIMD? It enables us to achieve data-level parallelism, while Multithreading such as discussed in C and Rust provides thread-level parallelism.
Data-level parallelism is considered to be more low-level compared to thread-level. It is also known as Vectorization in modern terms, though the two are not exactly the same, they have similar primary characteristic: executing operations in parallel on multiple data elements.
Languages: C - Zig - Rust
As usual the implementation will be done in my 3 favorite languages. In this post I want to do something different, I'm going to show the diff in Neovim between C and Zig which hopefully can show you how similar the two languages are ~ if you know C, you already know Zig!
[Update]: you need to "Open image in new tab" then double click the image for it to zoom-in
For easy replication I will provide the full codes in Pastebin repo.
Particle Simulation
Part 1 - consts, structs and allocators
Note in the ParticleData struct, I am implementing Struct of Arrays (SoA) instead of Array of Structs (AoS) as mentioned in the Introduction above.
Part 2 - initializers and free memory
Note how concise the freeParticleData function in Zig, particularly since it's using std.meta.fields which can automatically apply the process in the relevant Struct (ParticleData).
Part 3 - update particles using SIMD
Note I added two helper functions in the Zig version: loadVector and storeVector. These are to reduce verbosity. Verbosity is one thing I dislike from Zig, but as you can see we can "hack it" via helper function.
Part 4 - main function
Nothing particularly interesting here, the print statement is to make sure the compiler does not optimize-out all the processes.
Matrix Multiplication
Part 1 - consts, initializers, matrix multiply using SIMD
Part 2 - Main function, snippet of the scalar function
Benchmark
We have arrived at the benchmark section, get ready for some numbers!
[Update]: as mentioned by a reader in my X/twitter account, please note in the Rust Matrix Multiplication, the columns are in second while the others are in millisecond
Conclusion
As we can see from the benchmark numbers, applying SIMD significantly outperforms scalar processing in various programs no matter what programming language we use.
Pastebin repo
- Particle Simulation:
- Matrix Multiplication:
How I compile the different programs
- C:
gcc-14 -o c_particle_sim_simd -O3 c_particle_sim_simd.c -march=native -std=c23
- Zig:
zig build-exe -O ReleaseFast zig_particle_sim_simd.zig
- Rust:
cargo rustc --release --bin rust_particle_sim_simd -- -C opt-level=3 -C target-cpu=znver2 -C target-feature=+avx2