Autograd optimization v2

14 Jan, 2025

Introduction

This is a continuation of the first Autograd post. In v2 we are bringing some optimizations and I will walk you through as what I did to massively speedup the program.

Random number

In this v2 we are going to use random number generator to feed into the program. Previously we used fixed floats (1.0 and 2.0) to check if the program outputs the correct solution. However the compilers from each languages do some different levels of optimization, even when we use similar modes: -O3 for C, -O ReleaseFast for Zig and --release for Rust.

Previous v1 benchmark - normalized

In order to have the benches normalized, I've modified the v1 to use the random number generator especially since I know all the programs outputs the correct solution.

opt1-rand

v2 benchmark - normalized

Let's switch things up, let's start from the finish: below is the optimized iteration v2 bench number, also normalized using the random number generator.

opt2-rand

In this format it's easier to compare v1 and v2 bench numbers. Also since King C is on the top this time, we are going to use the C code for the discussion.

Note compiled Python (PyCodon) is faster than Rust and almost as fast as Zig. This shows how a Python code, once it's properly optimized and compiled it is able to achieve the same playing field with these super performant programming languages...why? Deep inside Python is buried a powerful C engine, we just need to know how to unlock it!

Optimization levels

Level v0 : no optimization initiated, just make the program run correctly. (Done)
Level v1 : first iteration of optimization. (Done)
Level v2 : second iteration, discussed in this post. (Done)
Level v3 : SIMD. (WIP)
Level v4 : Multithreading. (Future WIP)

Performance Improvements

1. Stores the results of operations and reuse them
--> reduced redundant work, reduce computational overhead.

--Note--: on PC you need to "Open image in new tab" then double click the image for it to zoom-in. On mobile phone you can just pinch-zoom to enlarge the image.

v2-00

In v1 (left split), the program does not store the results of operations. Instead, it recalculates values during the forward pass every time they are needed. For example:

The value of sum_xy is recalculated when it is used in prod_sum_xy.
The value of prod_sum_xy is recalculated when it is used in softplus_prod.

In v2 (right split), the program stores the results of operations in the NaiveTape structure and reuses them whenever needed. This eliminates the need for recalculation:

The value of sum_xy is stored in tape.vars[sum_xy_id] and reused in prod_sum_xy.
The value of prod_sum_xy is stored in tape.vars[prod_sum_xy_id] and reused in softplus_prod.

--Improvements--

Results are calculated once and stored for reuse, reducing computational overhead.

Faster performance due to elimination of redundant calculations.

2.1 Storing results in an array improves cache efficiency.
v2-01

In v2 (right split), the NaiveTape structure stores all variables (vars) and operations (ops) in contiguous arrays:

Each variable (NaiveVar) stores its value (val) and gradient (grad).
Each operation (MyOperation) stores its type, input ids, and output ids.

Deep dive:

Reduced memory footprint:
v1: large and inefficient data structures. Using pointers consumes additional memory.
v2: compact data structures. Uses integer indices, omitting the need of pointers.
Better cache locality:
v1: NaiveVar and MyOperation objects are scattered in memory due to dynamic allocation.
v2: all data is stored in contiguous arrays within the NaiveTape struct.
Elimination of Memory Pool:
v1: fixed size memory pool, introducing out-of-bounds risk.
v2: no mem pool, objects are stored directly in the array within the NaiveTape struct.

I wrote a post on why using index is better compared to pointers in an ArrayList. TLDR: pointers uses more memory and are more prone to memory errors. Array index is better in this use case.

2.2 Reusing Results.
When an operation is performed, its result is stored in the vars array and can be accessed later using its index. For example:

Sum Operation:

The result of tape_sum is stored in tape.vars[sum_xy_id].
This result is reused in the tape_prod operation.

Product Operation:

The result of tape_prod is stored in tape.vars[prod_sum_xy_id].
This result is reused in the tape_softplus operation.

Reference code: v2-02

2.3 Faster execution.
The elimination of redundant calculations and improved cache efficiency lead to faster execution in v2. This is particularly noticeable in the main loop, where the program performs millions of iterations.

3 Simplified Backward Pass.
v1: Pointer based to access NaiveVar objects -> memory indirection, slow.
v2: Index based -> eliminates pointer indirection, fast.

v2-03

Conclusion

As you can see, massive speedups are achievable from applying these optimizations, especially the storing and reusing the results which eliminates redundant processing.

Note we have yet to even touch optimization levels where we go deep (yes, this is so deep!) into SIMD and then Multithreading. But that need to wait for the next weekend.

Pastebin - code repo

You can find the code for each languages here:

Python (compiled using Codon).
C
Zig
Rust