*ฅ^•ﻌ•^ฅ* ✨✨  HWisnu's blog  ✨✨ о ฅ^•ﻌ•^ฅ

Building cgrep using safe_c.h custom header

This article contains a lot of images, if your browser failed to load them, try a different browser. Firefox based browsers looks to have issues loading the images.

Introduction

This is continuation of the first series on using custom C header file (safe_c.h) to build a sizeable small program (2-3k loc) as an experiment related to DX and the final program result.

If the first series I put the focus on safe_c.h header file, in this second series I want to talk in depth on the final program result: cgrep. The thing is, there are plenty of angles I want to talk about especially because this has been a long chain reaction of investigation I did and in order to get the complete picture I will need to refer you to my past blog posts. These investigations I did in the end inspired me to build a (rip)grep clone.

Fasta benchmarks game

I love doing benchmarks of all kinds, I have a knack for performance..I would fight for an extra millisecond faster program. Now imagine my shock when I found out some of the most popular benchmark websites (un)knowingly mislead people with their summary table. Read the article for the whole story but the essence can be summarized:

Summary tables from both websites shows Rust is #1 on the rank, as the saying goes: "Blazingly fast!". The punchline: the Fasta Rust programs are both multi-threaded, while the C and Zig programs are single-threaded.

"What a blazingly ass comparison!" was my first thought and from there I went on a "witch hunt" for these kinds of misleading practices.

Levelized cost of resources in benchmark

In this article I pointed out the need to levelize or normalize resource cost / usage. It's great to see program XY is almost 3x faster compared to program AB, but how do you feel if it's at the cost of 8x more resource usage? Is it worth it?

I also underlined ripgrep github benchmark that unfortunately compared multi-threaded to single-threaded program, while ripgrep got the option to set -j1 (to make it single-threaded) but as you can check yourself, the benchmarks are not using the -j1 flag. Do you think it's a fair comparison?

Another important point I made at the end of the article: doing benchmarks in quiet vs noisy system. This point of view inspires me in the later part of the article to split the cgrep vs ripgrep benchmark in both systems (quiet & noisy) as I think noisy system better simulates real usage environment.

The point I want to make:

When using a program (such as [rip]grep), do you close the other running processes first? Or would you just run the program ALONGSIDE other running processes? Exactly my point and hence the noisy system test is important as well coz some programs have deep performance regression when ran in a noisy system.

Compiler optimization options

This article serves as complementary coz it discusses a lot surrounding ripgrep.

cgrep : high performance grep

Written in C23, the program serves as my experimental project on using my custom safe_c.h header file with the purpose: "write C, get C++ and Rust features". Someone posted the first article about safe_c.h on hackernews and it ranked #3 for a short while, what's interesting was the discussion and I love reading those multiple different PoVs.

For my future C projects I will keep on using safe_c.h as it has become an invaluable tool for me and it has elevated my DX albeit the C I'm writing might seem foreign to some.

Check below one of cgrep's function: safe_c23 Does it look unfamiliar? i32 instead of int32_t, u8* instead of char*. Yeah it might look strange, but actually this is how we do it in Zig and Rust (two of my fave langs).

Main Workflow: a team of supercharged workers

Producer-Consumer multi threading model where one Producer scans directories while applying "smart filters" where it ignores dirs such as node_modules, .gitand puts valid file paths for the Consumer (worker threads) to process. It also applies back-pressure for a situation where the Producer finds files faster than the Consumer can read them, the queue fills up.

In the benchmark session you'll find I always put -j4 flag in ripgrep's command, this is because I hard-coded the number of threads to 4 in cgrep. When writing a program I prefer to find another solution other than the dreaded "add more threads!" which is embarassingly parallel. I also prefer to have cohesion, the balance of the overall system is more important than having all cores (mine is 16) to run a program.

This is because in my normal daily workflow I have a fairly noisy system (Bloomberg non-stop, trading workstation to run my algo, 20+ tabs to do market research, and sometimes while having 4-5 Python instances to do data scraping). This is why I built cgrep to be more durable against these kinds of environment (noisy system) and you shall see from the benchmark the performance regression is not as magnified compared to ripgrep.

Producer thread:

static void thread_pool_add_task(ThreadPool* pool, const char* path)
{
    pthread_mutex_lock(&pool->queue.mutex);
    
    // back-pressure: wait if full queue.
    // prevents the main thread from eating all RAM if the disk scan 
    // is faster than the worker threads.
    while (pool->queue.pending_count == pool->queue.capacity && !pool->queue.shutdown) {
        pthread_cond_wait(&pool->queue.not_full, &pool->queue.mutex);
    }
    
    // ..add path to queue..
    
    pthread_mutex_unlock(&pool->queue.mutex);
}

Consumer thread:

static void* worker_thread_func(void* arg) {
    ThreadPool* pool = (ThreadPool*)arg;
    
    // each worker gets its own huge-page buffer to reuse
    u8* read_buffer = allocate_huge_buffer(CHUNK_READ_SIZE + PADDING_SIZE);

    while (true) {
        pthread_mutex_lock(&pool->queue.mutex);
        
        // wait until the queue has files
        while (pool->queue.pending_count == 0 && !pool->queue.shutdown) {
            pthread_cond_wait(&pool->queue.not_empty, &pool->queue.mutex);
        }
        
        // exit condition
        if (pool->queue.shutdown && pool->queue.pending_count == 0) {
            pthread_mutex_unlock(&pool->queue.mutex);
            break;
        }
        
        // pop a file path from the ring buffer
        FileTask task = pool->queue.buffer[pool->queue.tail];
        pool->queue.tail = (pool->queue.tail + 1) % pool->queue.capacity;
        pool->queue.pending_count--;
        
        // let the Producer know that a slot opened up (reducing back-pressure)
        pthread_cond_signal(&pool->queue.not_full);
        pthread_mutex_unlock(&pool->queue.mutex);
        
        // the heavy process: read file & run SIMD search
        process_file_wrapper(pool, task.path, tid, read_buffer);
        free(task.path);
    }
    
    free(read_buffer);
    return NULL;
}

The 4 thread hard coded:

// configuration: deliberately limits to 4 workers to play nice 
// with a noisy system environment.
#define DEFAULT_NUM_WORKER_THREADS 4

// inside run_recursive_mode() function:
for (size_t i = 0; i < DEFAULT_NUM_WORKER_THREADS; i++) {
    pthread_create(&pool.threads[i], NULL, worker_thread_func, &pool);
}

Engine Room: processing bulk data with SIMD

At start cgrep checks your system's SIMD capabilities and automatically selects the best implementation (AVX512, AVX2, SSE). Using SIMD is like going from using straw to try extinguish a fire to using a proper firehose ~ instead of loading one character at a time, SIMD loads 64 at once.

I added an extra 64 bytes padding for SIMD safety, this also remove the necessity to bounds check constantly which is one of the causes of slow down.

// core of the AVX-512 search engine
// instead of checking one char at a time, we load 64 bytes into a register.

__attribute__((target("avx512f,avx512bw,avx512vl")))
static const u8* simd_search_avx512(const u8* haystack, size_t haystack_len, ...)
{
    // ..setup..
    const __m512i fc = _mm512_set1_epi8((char)needle[0]); 

    for (size_t i = 0; i < max; i += 64) {
        __mmask64 m = _mm512_cmpeq_epi8_mask(_mm512_loadu_si512(haystack+i), fc);
        
        while (m) {
            // find the index of first match using hardware instruction (count trailing zeros)
            const size_t pos = (size_t)__builtin_ctzll(m);
            
            // verify full string match..
            // clear the bit and keep searching this chunk
            m &= m - 1;
        }
    }
    return NULL;
}

Memory Strategy: think big, think in arenas and pages.

cgrep uses Arenas. For temporary data, a thread asks for one giant block of memory (an "arena") and then carves out the small pieces it needs from within that block. When it's done, it can discard the entire arena at once. This is vastly faster and keeps memory tidy.

typedef struct {
    char* buffer;
    size_t size;
    size_t offset;
    size_t peak_usage;
    pthread_mutex_t mutex;
} Arena;

// allocation is just pointer arithmetic: O(1) speed
static inline void* arena_alloc(Arena* a, size_t s)
{
    // align size to 8 bytes
    s = (s + 7) & ~(size_t)7;
    
    // take the next chunk of memory
    void* ptr = a->buffer + a->offset; 
    a->offset += s;
    
    return ptr;
}

This will be explained more in the benchmark section where cgrep got much more cache references but has much less cache misses ratio.

// allocation: create a buffer backed by huge pages
// minimizes TLB misses during scanning.
static void* allocate_huge_buffer(size_t size) {
    void* ptr = NULL;
    // align to 2MB boundary
    posix_memalign(&ptr, 2 * 1024 * 1024, size);
    // Tell the OS to back this with Huge Pages
    madvise(ptr, size, MADV_HUGEPAGE);
    return ptr;
}

I/O strategy ~ chunked streaming instead of mmap!

Actually in one of cgrep iteration, I wrote the mmap functionality but then I decided to rip it out coz I didn't like the massive memory usage on large files. On paper mmap sounds great though: put the file in memory to reduce copying overhead. However on practice when I tested and did some benchmark run between using mmap and chunked streaming, mmap provide no lift up in performance.

// streaming loop
while (true) {
    // read a massive 2MB chunk
    const ssize_t bytes = read(ss->fd, buf, CHUNK_READ_SIZE);
    
    if (bytes <= 0) break;

    // safety padding
    // manually zero-out the memory immediately after the file data.
    // allows AVX-512 to safely load vectors without segfaulting.
    memset((u8*)buf + bytes, 0, PADDING_SIZE); 
    
    // pass the safe buffer to the engine
    process_chunk(buf, bytes, ...);
}

Memory Safety Net : safe_c.h comes into play!

With the use of my custom header file -- memory leaks, buffer over/underflows, dangling pointers becomes irrelevant: smart pointers via AUTO_UNIQUE_PTR, bounds checking via StringView and Vector, RAII implementation for memory, files and mutexes, and many other discussed in the first article of this series.

Write high performance low level code without sacrificing safety.

// inside the worker thread that reads files:
static void* worker_literal_single(void* arg)
{
    // ..context setup..

    // AUTO_MEMORY == a smart pointer for raw memory.
    // allocates the buffer and registers a cleanup function.
    AUTO_MEMORY(buf, CHUNK_READ_SIZE + PADDING_SIZE);
    
    if (!buf) return NULL;

    while (true) {
        // ..file reading logic..
        const ssize_t bytes = read(ss->fd, buf, CHUNK_READ_SIZE);
        if (bytes <= 0) { 
            // case study: early exit
            // break the loop here because the file is done or an error occurred.
            // in legacy C, you would likely forget to free(buf) here, causing a leak.
            // with safe_c.h, we just break.
            break; 
        }

        // ..processing logic..
    }
    
    return NULL; 
    // 'buf' is automatically freed here when the function scope ends.
}

Regex Engine: PCRE2

cgrep uses PCRE2 JIT compiler with a specific design strategy to make it extra speedy: Avoidance. The thing is, the fastest way to run regex is to not run it at all (~ avoiding work for the regex engine). Let's say I want to search for User ID: \d+ :

  1. cgrep analyzes the pattern and extracts a fast needle: the string "User ID: ". It knows that no match can possibly exist unless that specific literal string is present.
  2. Handshake: cgrep unleashes the SIMD engine to scan for "User ID: " literal string while the regex engine sits idle, doing nothing (this is by design ~ avoiding unnecessary work).
  3. Only when the SIMD engine finds the needle, then it wake up the PCRE2 engine to verify if the digits \d+ actually follow.

This turns an O(N) complexity into a sub-linear search: if I'm scanning a 200MB log file and "User ID" only appears twice, the regex engine only runs twice too!

  1. Thread-local scratchpads: PCRE2 requires "match data" blocks to store information about where a match starts and ends. Allocating and freeing these blocks inside a loop would be catastrophic for performance!

Instead, cgrep allocates these scratchpads once per thread at startup. Each worker thread carries its own pcre2_match_data context. This allows the threads to run fully in parallel without fighting over memory locks or waiting for the operating system to allocate space for the results.

By combining the raw speed of native assembly (JIT) with the "Lazy" evaluation of fast needle skipping, cgrep ensures that the heavy cost of regex is paid only when necessary.

Benchmark - recursive directories

Okay let's get to my favorite section: benchmarking! Lets start from the quiet test system: quiet In the quiet system, there's minimum running processes, consider it as if it's just rebooted with idle system usage. Result: cgrep is a bit faster compared to ripgrep (around 25% faster). Note the test was recursive grepping my entire Documents directory (and its sub dirs) which is quite massive with hundreds of sub dirs and thousands / tens of thousands files. I didn't include the original GNU grep coz it took > 70 seconds to finish and each of these tests were repeated 20-30x, took too much time if I included GNU grep.

Note: for the next investigation you might need to zoom-in the image above to check the cache misses ratio between cgrep and ripgrep:

Below is the result for noisy test system

Here's how I did the test in the noisy system, :

  1. Running 2 instances of each zoop, hyperfine and mitata benchmark in parallel at the same time.
  2. Background processes: Trader Workstation app and a browser with 5 tabs open.

Result: the gap widen, cgrep is now 40-60% faster than ripgrep.

Now if you ask why bother differentiating between these quiet and noisy test environments? This goes back to the Levelized Cost of Resources article mentioned in the opening. The argument:

I want to also encourage benchmark testers to start including noisy system test into their workflow, because this exercise captures a very important quality that is often forgotten: the durability of the program when facing non-ideal conditions. And from the bench it shows, cgrep has better durability, it does show performance regression as well, but not as bad as ripgrep's regression.

Do you know the VW scandal back in 2008-2015? Volkswagen was caught manipulating its emmision test. Open air vs lab test showed open air got 35x more emission compared to the lab test (lab environment == the ideal condition). Turned out VW manipulated the software in the cars' engines to recognize it's on emission test mode, and then produced great bench output.

This is totally wrong and misleading practice! You can read about it here, here and here.

Note: each of these grep tests has been repeated 20-30x and while there were some slight variations on the results, aggregated outcome paints the same picture.

Additional info from utime stats, found matches are the consistent for the test command. utime_stats

Benchmark - single large file

Ran on a 200MB config file with 10 million lines of data. singleLF Note: by default ripgrep memory usage can be quite high on large files, but you can use the --no-mmap flag and it will use small memory footprint. Additional note: I had set ripgrep to use 4 threads via the -j4 flag but it only used 99% CPU..not my fault!

Another speciality of cgrep is its burst write capability -- I'm gonna let some screenshots do the talking: grep_write Comparison: cgrep finished writing the 832 thousands lines of output in 2.7s, got zero inputs and very small minor page faults compared to ripgrep.

Conclusion

The biggest takeaway from this experimental project is that I'm going to keep using safe_c.h on all my C projects going forward coz it's so damn good and convenient! It's so good I think I'm going to reduce writing Zig programs (at least until Zig's async i/o is released).

As for cgrep, I'm actually not surprised with it surpassing ripgrep in performance. If you had been following along and carefully read the articles mentioned at the start of this post, you'd see why I'm not at all surprised...the breadcrumbs were clear, akin to "it was written on the wall".

From my observation of running the benchmark, cgrep runs faster in both quiet and noisy environment and uses much less memory.

Yeah for sure currently feature-wise cgrep only got like less than 30% of what's available in the original grep and ripgrep, but these features are the ones I often use and as mentioned before: I built cgrep for my personal use.

I've added a "Summary Table", a feature I felt missing from grep and ripgrep: table_summary

What's next?

So far I'm satisfied with cgrep in terms of feature and performance and I've already got several other projects on the pipeline:

Comments section here

If you enjoyed this post, click the little up arrow chevron on the bottom left of the page to help it rank in Bear's Discovery feed and if you got any questions or anything, please use the comments section.