Autograd - in C, Zig, Rust and PyCodon

27 Nov, 2024

Introduction

So...I'm supposed to be on my DevOps arc, but here I am writing a piece on Low Level programming! The pull to this path is too strong, I can't fight it..I'll get back to DevOps by the end of the week, I promise!

Autograd

There are great references on Autograd:

Pytorch Autograd: text and video.
Medium article

Taken from Pytorch's website:

PyTorch’s Autograd feature is part of what make PyTorch flexible and fast for building machine learning projects. It allows for the rapid and easy computation of multiple partial derivatives (also referred to as gradients) over a complex computation. This operation is central to backpropagation-based neural network learning.

Machine Learning in general use case.

Machine Learning has been proven to be highly useful in variety of sectors:

Healthcare: bolsters the advancement of making new and advanced type of drugs. In an article I read (but failed to locate), ML enables rapid drug discovery: the R&D process that would take decades, now can be done in a year.
Logistics: warehouse management, route optimization, supply - demand prediction, etc.
Finance: credit risk assessment, fraud detection, macroeconomic prediction, etc.
...

Machine Learning in stock portfolio management

I want to point out the ML use in my main domain: fund management. Let me just provide you with the historical performance graph of ML-based portfolio or also widely known as Quant portfolio:

YTD up to October 2024.
One of the lowest performance.
5 Year compared to other strategies.
The 5Yr CAR (compound annual rate) far behind HF Composite and Equities.
10 Year performance comparison.
In the long run (5-10 Yr), very poor performance.

So yeah, you can draw your own conclusion and next time someone tries to hype up or sell you sophisticated Quant fund portfolio, you can make a better judgement instead of being misled.

The point I want to make: up to now, conventional fund management is still on the upper hand compared to Machine Learning -based Quant fund. You don't have to believe me, you can download the PDF report here.

I can write more on this below par performance of Quant funds and it's interesting to see how they performed when the Covid-19 pandemic hits (spoiler: it was one of the worst!).

Back to code!

Alright enough with the finance stuff, we are here to discuss the Autograd benchmark! It is customary for me to provide the code from the most performant language, since it would make the post too lengthy if I include all the codes from 4 languages.

I will post all the codes in a code repository, but in the mean time..the most performant is Zig.

[Update!] you can find the codes in pastebin:

#1 Consts

As usual when writing Zig, I dislike Zig's verbosity so the consts here are my effort to reduce the verbosity.

const std = @import("std");
const testing = std.testing;
const print = std.debug.print;
const time = std.time.milliTimestamp;
const Allocator = std.mem.Allocator;
const ArenaAlloc = std.heap.ArenaAllocator;
const pageAlloc = std.heap.page_allocator;

#2 Enums and Structs

OpType enum and NaiveTape struct are self-explanatory.
Operation struct: represent an operation with an OpType, two input variables, and an output variable.

const OpType = enum(u2) {
    Sum,
    Prod,
    Softplus,
};

const NaiveVar = struct {
    val: f64,
    grad: f64,

    pub fn init(val: f64) NaiveVar {
        return .{ .val = val, .grad = 0.0 };
    }
};

const Operation = struct {
    op_type: OpType,
    inputs: [2]*NaiveVar,
    output: NaiveVar,
};

#3 NaiveTape struct

init: create new instance.
deinit: free the memory allocated.

const NaiveTape = struct {
    ops: []Operation,
    allocator: Allocator,

    pub fn init(allocator: Allocator) NaiveTape {
        return .{
            .ops = &[_]Operation{},
            .allocator = allocator,
        };
    }

    pub fn deinit(self: *NaiveTape) void {
        self.allocator.free(self.ops);
    }
    // ---Place the various methods here--- //
    // ---Place the various methods here--- //
};

Below are methods on the NaiveTape struct

#3a Sum method

Adds two input variables and returns the result as a new variable.

    pub fn sum(self: *NaiveTape, input1: *NaiveVar, input2: *NaiveVar) !*NaiveVar {
        const sum_val = input1.val + input2.val;

        const sum_var = NaiveVar.init(sum_val);

        const op = Operation{
            .op_type = .Sum,
            .inputs = [_]*NaiveVar{ input1, input2 },
            .output = sum_var,
        };

        self.ops = try self.allocator.realloc(self.ops, self.ops.len + 1);
        self.ops[self.ops.len - 1] = op;

        return &self.ops[self.ops.len - 1].output;
    }

#3b Prod method

Multiplies two input variables and returns the result as a new variable.

    pub fn prod(self: *NaiveTape, var1: *NaiveVar, var2: *NaiveVar) !*NaiveVar {
        const prod_val = var1.val * var2.val;
        const prod_var = NaiveVar.init(prod_val);

        const op = Operation{
            .op_type = .Prod,
            .inputs = [_]*NaiveVar{ var1, var2 },
            .output = prod_var,
        };
        self.ops = try self.allocator.realloc(self.ops, self.ops.len + 1);
        self.ops[self.ops.len - 1] = op;

        return &self.ops[self.ops.len - 1].output;
    }

#3c Softplus method

Applies the softplus activation function to a single input variable and returns the result as a new variable.

    pub fn softplus(self: *NaiveTape, nvar: *NaiveVar) !*NaiveVar {
        const softplus_val = std.math.log1p(std.math.exp(nvar.val));
        const softplus_var = NaiveVar.init(softplus_val);

        const op = Operation{
            .op_type = .Softplus,
            .inputs = [_]*NaiveVar{nvar, undefined}, // Only one input needed, second is unused
            .output = softplus_var,
        };
        self.ops = try self.allocator.realloc(self.ops, self.ops.len + 1);
        self.ops[self.ops.len - 1] = op;
        return &self.ops[self.ops.len - 1].output;
    }

#3d Backward method

The method iterates through the operations in the reverse order of their execution (i.e., from the output to the inputs).

    pub fn backward(self: *NaiveTape, nvar: *NaiveVar) void {
        nvar.grad += 1.0;
        var i = self.ops.len;
        while (i > 0) {
            i -= 1;
            const op = self.ops[i];
            switch (op.op_type) {
                .Sum => {
                    const output_grad = op.output.grad;
                    op.inputs[0].grad += output_grad;
                    op.inputs[1].grad += output_grad;
                },
                .Prod => {
                    const output_grad = op.output.grad;
                    const input1_val = op.inputs[0].val;
                    const input2_val = op.inputs[1].val;
                    op.inputs[0].grad += input2_val * output_grad;
                    op.inputs[1].grad += input1_val * output_grad;
                },
                .Softplus => {
                    const output_grad = op.output.grad;
                    const input_val = op.inputs[0].val;
                    op.inputs[0].grad += output_grad / (1.0 + std.math.exp(-input_val));
                },
            }
        }
    }

#5 Main

The main function:

Loop the process 1 million times.
Computes sum, product, and softplus of input variables.
Computes gradients of output variables with respect to input variables.
Prints the values, gradients and elapsed time to run the loop.

pub fn main() !void {
    var arena = ArenaAlloc.init(pageAlloc);
    defer arena.deinit();
    const allocator = arena.allocator();

    const iterations: usize = 1000000;

    const start_time = time();
    var i: usize = 0;
    while (i < iterations) : (i += 1) {
        var var1 = NaiveVar.init(1.0);
        var var2 = NaiveVar.init(2.0);

        var tape = NaiveTape.init(allocator);
        defer tape.deinit();

        const sum_var = try tape.sum(&var1, &var2);
        const prod_var = try tape.prod(sum_var, sum_var);
        const softplus_var = try tape.softplus(prod_var);

        tape.backward(softplus_var);

        if (i == iterations - 1) {
            print("sum_var val: {d:}\n", .{sum_var.val});
            print("prod_var val: {d:}\n", .{prod_var.val});
            print("softplus_var val: {d:}\n", .{softplus_var.val});
            print("sum_var grad: {d:}\n", .{sum_var.grad});
            print("var1 grad: {d:}\n", .{var1.grad});
            print("var2 grad: {d:}\n", .{var2.grad});
        }
    }
    const end_time = time();
    const elapsed_time = @as(f64, @floatFromInt(end_time - start_time));
    print("\nElapsed time: {d:.3} ms\n", .{elapsed_time});
}

Benchmark result

autograd-opt-01

Note: I applied the same level of code optimization across all the languages. Currently I managed to apply the 01 optimization iteration to the 4 langs.

I've actually done 02 optimization but only for C and the result was staggering: elapsed time down to 0.3ms. However I haven't got the time to implement 02 version to the rest, will update the post once I've got it done.

What's next?

I will keep updating the post when I've finished the next iteration of optimizations. Also I might write some series on these optimizations I've done especially in C, Zig and Rust.

A sneak peek of the iteration I did:

Level 00 : no optimization initiated, just make the program run correctly. (Done)
Level 01 : first iteration of optimization. (Done)
Level 02 : second iteration. (Done in C. WIP for Zig and Rust).
Level 0x : x iteration...you get the gist.