*ฅ^•ﻌ•^ฅ* ✨✨  HWisnu's blog  ✨✨ о ฅ^•ﻌ•^ฅ

Memory scope and lifetimes

Introduction: benchmark projects

Over the past few weeks, I've been working on some of the most interesting benchmark projects I've undertaken to date: n-body simulation and fast matrix multiplication. To make them even more challenging, I decided to implement them using multi-threading.

Background and Code Rewrite

The matrix multiplication program was originally written by a member of the Xwitter's Zig community: Snow's thread. I took his code, which was written in Zig, and rewrote it in C. I enjoy rewriting code from one language to another, as it helps me understand the strengths and weaknesses of each language. Typically, I work with languages like C, Zig, Rust, and Python.

Optimizing the Code

After successfully rewriting the code in C, I began optimizing Snow's original code. To my surprise, I was able to achieve an additional 10% performance gain with minimal changes to the code. Since the codebase is quite large, with over 250 lines of code, I'll only be sharing the most relevant parts.

Hint: it all got to do with the title of this article: memory scope and lifetimes!

Below is the code from Snow:

const std = @import("std");

pub fn calculateGflops(allocator: std.mem.Allocator, M: usize, N: usize, K: usize, iterations: usize) !f64 {
    const A = try allocator.alloc(f32, M * K);
    defer allocator.free(A);
    const B = try allocator.alloc(f32, K * N);
    defer allocator.free(B);
    const C = try allocator.alloc(f32, M * N);
    defer allocator.free(C);

    // Initialize matrices
    for (A, 0..) |*val, i| {
        val.* = @floatFromInt(i % 10);
    }
    for (B, 0..) |*val, i| {
        val.* = @floatFromInt((i + 1) % 10);
    }

    // Warmup run
    try tiledMatMul(allocator, A, B, C, M, N, K);

    // Timed runs
    var timer = try std.time.Timer.start();
    for (0..iterations) |_| {
        try tiledMatMul(allocator, A, B, C, M, N, K);
    }
    const elapsed_ns = timer.read();

    // Calculate GFLOPS
    const ops = 2 * M * N * K * iterations; // multiply-add is 2 operations
    const seconds = @as(f64, @floatFromInt(elapsed_ns)) / 1e9;
    const gflops = @as(f64, @floatFromInt(ops)) / seconds / 1e9;

    return gflops;
}


pub fn main() !void {
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();
    const allocator = arena.allocator();
    const sizes = [_][3]usize{
        .{ 256, 256, 256 },
        .{ 512, 512, 512 },
        .{ 1024, 1024, 1024 },
        .{ 1024, 2048, 1024 },
        .{ 2048, 2048, 2048 },
    };

    const iterations = 10;

    std.debug.print("T = {} \n V = {} \n", .{ T, V });

    for (sizes) |size| {
        const M = size[0];
        const N = size[1];
        const K = size[2];

        const gflops = try calculateGflops(allocator, M, N, K, iterations);

        std.debug.print("Matrix size: {}x{}x{}, GFLOPS: {d:.2}\n", .{ M, N, K, gflops });
    }
}

My version of the code with subtle changes:

calculateGflops Function

calculates the GFLOPS (gigaflops) of a matrix multiplication operation. It takes four parameters:

The function initializes two matrices A and B with random values, and then performs a warmup run of the matrix multiplication. It then performs the timed runs of the matrix multiplication, and calculates the GFLOPS based on the elapsed time and the number of operations performed.

const std        = @import("std");
const print      = std.debug.print;
const tMilli     = std.time.milliTimestamp;
const Prng       = std.rand.DefaultPrng;
const Allocator  = std.mem.Allocator;
const PageAlloc  = std.heap.page_allocator;
const ArenaAlloc = std.heap.ArenaAllocator;

pub fn calculateGflops( M: usize, N: usize, K: usize, iterations: usize) !f64 {
    var arena = ArenaAlloc.init(PageAlloc);
    defer arena.deinit();
    const allocator = arena.allocator();

    const A = try allocator.alloc(f32, M * K);
    defer allocator.free(A);
    const B = try allocator.alloc(f32, K * N);
    defer allocator.free(B);
    const C = try allocator.alloc(f32, M * N);
    defer allocator.free(C);

    // Initialize matrices
    for (A, 0..) |*val, i| {
        val.* = @floatFromInt(i % 10);
    }
    for (B, 0..) |*val, i| {
        val.* = @floatFromInt((i + 1) % 10);
    }

    // Warmup run
    try tiledMatMul(A, B, C, M, N, K);

    // Timed runs
    var timer = try std.time.Timer.start();
    for (0..iterations) |_| {
        try tiledMatMul(A, B, C, M, N, K);
    }
    const elapsed_ns = timer.read();

    // Calculate GFLOPS
    const ops = 2 * M * N * K * iterations; // multiply-add is 2 operations
    const seconds = @as(f64, @floatFromInt(elapsed_ns)) / 1e9;
    const gflops = @as(f64, @floatFromInt(ops)) / seconds / 1e9;

    return gflops;
}

main function

Defines a list of matrix sizes and performs the matrix multiplication for each size. It then prints the GFLOPS for each size.

Note: The !void return type of the main function indicates that it may return an error. The try keyword is used to handle errors, and the defer keyword is used to ensure that resources are released when they are no longer needed.

Error Handling:

Resource Management:

pub fn main() !void {
    const sizes = [_][3]usize{
        .{ 256, 256, 256 },
        .{ 512, 512, 512 },
        .{ 1024, 1024, 1024 },
        .{ 1024, 2048, 1024 },
        .{ 2048, 2048, 2048 },
    };

    print("Sizes length: {d}\n", .{sizes.len});
    const iterations = 10;

    print("T = {}\nV = {} \n", .{ T, V });

    for (sizes) |size| {
        const M = size[0];
        const N = size[1];
        const K = size[2];

        const gflops = try calculateGflops(M, N, K, iterations);

        print("Matrix size: {}x{}x{}, GFLOPS: {d:.2}\n", .{ M, N, K, gflops });
    }
}

Analyzing the Changes

Let's take a closer look at the changes I made to the code. One key difference is that I removed the allocator as a parameter in the calculateGflops function. But how did this change impact performance? By not passing the allocator when it's not needed, I simplified memory management and reduced memory overhead.

This simple change alone resulted in a 10% performance gain. The reason why this works is that the allocated memory is not used outside the calculatedGflops function. The function allocates memory, performs some calculations and then returns a result that does not depend on the allocated memory --> this means the allocated memory is not neded once the function returns and it is safe to deallocate it.

A Cautionary Tale: When Not Passing Allocator Can Backfire

Now I'm going to give another case study example where not passing around allocator would shot myself in the foot.

Passing allocator via getString function parameter:

const std        = @import("std");
const print      = std.debug.print;
const stdin      = std.io.getStdIn().reader();
const ArenaAlloc = std.heap.ArenaAllocator;
const PageAlloc  = std.heap.page_allocator;

fn getString(allocator: std.mem.Allocator) ![]u8 {

    var buffer: [100]u8 = undefined;
    std.debug.print("\nEnter a location: ", .{});

    const input = try stdin.readUntilDelimiterOrEof(buffer[0..], '\n');
    if (input) |str| {
        return try allocator.dupe(u8, str);
    } else {
        return try allocator.dupe(u8, "");
    }
}

pub fn main() !void {

    var arena = ArenaAlloc.init(PageAlloc);
    defer arena.deinit();
    const allocator = arena.allocator();

    const text = try getString(allocator);
    print("Text: {s}\n", .{text});
}

Not passing allocator:

fn getString() ![]u8 {

    var arena = ArenaAlloc.init(PageAlloc);
    defer arena.deinit();
    const allocator = arena.allocator();

    var buffer: [100]u8 = undefined;
    std.debug.print("\nEnter a location: ", .{});

    const input = try stdin.readUntilDelimiterOrEof(buffer[0..], '\n');
    if (input) |str| {
        return try allocator.dupe(u8, str);
    } else {
        return try allocator.dupe(u8, "");
    }
}

pub fn main() !void {

    const text = try getString();
    print("Text: {s}\n", .{text});
}

Outcome:

The first code will run fine, while the second code will crash:

thread panic: reached unreachable code

Explanation: not passing allocator as getString function parameter did not work because the allocated memory was returned from the function and used outside of the function --> this meant the allocated memory was still needed after the function returned, but it was deallocated because the allocator was destroyed when the function returned.

Conclusion:

In summary, the fast Matrix Multiplication program works because the allocated memory is not used outside of the calculateGflops function and it is safe to deallocated it when the function returns. This is in contrast to the other code example, where the allocated memory was returned from the function and used outside of the function, causing problems when the allocator was destroyed.

By understanding the scope of the memory allocator and the lifetime of the allocated memory, we can write more efficient and effective code that avoids common pitfalls like this one

#lifetimes #low level #memory #programming #scope #zig