SIMD vs Scalar: Unlocking the Power of Parallel Processing

08 Jan, 2025

Introduction

This is a follow up on my previous post "Struct of Arrays" which is part of a series on Data Oriented Design implementation.

SoA-1

SIMD - Single Instruction Multiple Data

A parallel processing technique executing a single instruction on multiple data elements simultaneously, increasing computational speed and efficiency.

Why SIMD? It enables us to achieve data-level parallelism, while Multithreading such as discussed in C and Rust provides thread-level parallelism.

Data-level parallelism is considered to be more low-level compared to thread-level. It is also known as Vectorization in modern terms, though the two are not exactly the same, they have similar primary characteristic: executing operations in parallel on multiple data elements.

Languages: C - Zig - Rust

As usual the implementation will be done in my 3 favorite languages. In this post I want to do something different, I'm going to show the diff in Neovim between C and Zig which hopefully can show you how similar the two languages are ~ if you know C, you already know Zig!

[Update]: you need to "Open image in new tab" then double click the image for it to zoom-in

For easy replication I will provide the full codes in Pastebin repo.

Particle Simulation

Part 1 - consts, structs and allocators

Note in the ParticleData struct, I am implementing Struct of Arrays (SoA) instead of Array of Structs (AoS) as mentioned in the Introduction above.

Part 2 - initializers and free memory

Note how concise the freeParticleData function in Zig, particularly since it's using std.meta.fields which can automatically apply the process in the relevant Struct (ParticleData).

Part 3 - update particles using SIMD

Note I added two helper functions in the Zig version: loadVector and storeVector. These are to reduce verbosity. Verbosity is one thing I dislike from Zig, but as you can see we can "hack it" via helper function.

Part 4 - main function

Nothing particularly interesting here, the print statement is to make sure the compiler does not optimize-out all the processes.

Matrix Multiplication

Part 1 - consts, initializers, matrix multiply using SIMD

matmul-1

Part 2 - Main function, snippet of the scalar function

matmul-2

Benchmark

We have arrived at the benchmark section, get ready for some numbers!

bench

[Update]: as mentioned by a reader in my X/twitter account, please note in the Rust Matrix Multiplication, the columns are in second while the others are in millisecond

Conclusion

As we can see from the benchmark numbers, applying SIMD significantly outperforms scalar processing in various programs no matter what programming language we use.

Pastebin repo

Particle Simulation:

Rust: SIMD
Zig: SIMD
C: SIMD

Matrix Multiplication:

Rust: SIMD
Zig: SIMD
C: SIMD

How I compile the different programs

C: gcc-14 -o c_particle_sim_simd -O3 c_particle_sim_simd.c -march=native -std=c23
Zig: zig build-exe -O ReleaseFast zig_particle_sim_simd.zig
Rust: cargo rustc --release --bin rust_particle_sim_simd -- -C opt-level=3 -C target-cpu=znver2 -C target-feature=+avx2