*ฅ^•ﻌ•^ฅ* ✨✨  HWisnu's blog  ✨✨ о ฅ^•ﻌ•^ฅ

Autograd optimization v2

Introduction

This is a continuation of the first Autograd post. In v2 we are bringing some optimizations and I will walk you through as what I did to massively speedup the program.

Random number

In this v2 we are going to use random number generator to feed into the program. Previously we used fixed floats (1.0 and 2.0) to check if the program outputs the correct solution. However the compilers from each languages do some different levels of optimization, even when we use similar modes: -O3 for C, -O ReleaseFast for Zig and --release for Rust.

Previous v1 benchmark - normalized

In order to have the benches normalized, I've modified the v1 to use the random number generator especially since I know all the programs outputs the correct solution.

opt1-rand

v2 benchmark - normalized

Let's switch things up, let's start from the finish: below is the optimized iteration v2 bench number, also normalized using the random number generator.

opt2-rand

In this format it's easier to compare v1 and v2 bench numbers. Also since King C is on the top this time, we are going to use the C code for the discussion.

Note compiled Python (PyCodon) is faster than Rust and almost as fast as Zig. This shows how a Python code, once it's properly optimized and compiled it is able to achieve the same playing field with these super performant programming languages...why? Deep inside Python is buried a powerful C engine, we just need to know how to unlock it!

Optimization levels

Performance Improvements

1. Stores the results of operations and reuse them
--> reduced redundant work, reduce computational overhead.

--Note--: on PC you need to "Open image in new tab" then double click the image for it to zoom-in. On mobile phone you can just pinch-zoom to enlarge the image.

v2-00

In v1 (left split), the program does not store the results of operations. Instead, it recalculates values during the forward pass every time they are needed. For example:

In v2 (right split), the program stores the results of operations in the NaiveTape structure and reuses them whenever needed. This eliminates the need for recalculation:

--Improvements--

2.1 Storing results in an array improves cache efficiency.
v2-01

In v2 (right split), the NaiveTape structure stores all variables (vars) and operations (ops) in contiguous arrays:

Deep dive:

  1. Reduced memory footprint:
    v1: large and inefficient data structures. Using pointers consumes additional memory.
    v2: compact data structures. Uses integer indices, omitting the need of pointers.
  2. Better cache locality:
    v1: NaiveVar and MyOperation objects are scattered in memory due to dynamic allocation.
    v2: all data is stored in contiguous arrays within the NaiveTape struct.
  3. Elimination of Memory Pool:
    v1: fixed size memory pool, introducing out-of-bounds risk.
    v2: no mem pool, objects are stored directly in the array within the NaiveTape struct.

I wrote a post on why using index is better compared to pointers in an ArrayList. TLDR: pointers uses more memory and are more prone to memory errors. Array index is better in this use case.

2.2 Reusing Results.
When an operation is performed, its result is stored in the vars array and can be accessed later using its index. For example:

  1. Sum Operation:
  1. Product Operation:

Reference code: v2-02

2.3 Faster execution.
The elimination of redundant calculations and improved cache efficiency lead to faster execution in v2. This is particularly noticeable in the main loop, where the program performs millions of iterations.

3 Simplified Backward Pass.
v1: Pointer based to access NaiveVar objects -> memory indirection, slow.
v2: Index based -> eliminates pointer indirection, fast.

v2-03

Conclusion

As you can see, massive speedups are achievable from applying these optimizations, especially the storing and reusing the results which eliminates redundant processing.

Note we have yet to even touch optimization levels where we go deep (yes, this is so deep!) into SIMD and then Multithreading. But that need to wait for the next weekend.



Pastebin - code repo

You can find the code for each languages here:

  1. Python (compiled using Codon).
  2. C
  3. Zig
  4. Rust