Reports

As @j-kadditz mentioned I probably should parallelize the process using SIMD or other parallelizing utilities, which is what I eventually will end up doing.

But, I implemented a VERY fast and efficient algorithm, all thanks to @weather-vane who suggested the idea (back when this post was on the Staging Ground).

You basically write the first pixel (4 bytes), then for each iteration we double the size of memory to be written. So we write 4 bytes, the next iteration memcpy the data now it's 8 bytes, the next iteration memcpy everything again now it's 16 bytes, again and again while checking that doubling the block doesn't exceed the total image size. If it does, just write the remaining pixels/bytes.

void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
    uint32_t nbytes = image->width * image->height * 4;

    memcpy(image->data, color.rgba, 4);

    uint32_t bytes_filled = 4;
    uint32_t next_fill;

    while (bytes_filled < nbytes)
    {
        next_fill = bytes_filled << 1;

        if (next_fill > nbytes)
        {
            next_fill = nbytes;
        }

        memcpy(image->data + bytes_filled, image->data, next_fill - bytes_filled);
        bytes_filled = next_fill;
    }
}

I profiled using gprof and check this out

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 47.67     19.58    19.58                             _mcount_private
 22.25     28.72     9.14 1137265408     0.01     0.01  pxlImageSetPixelColor
 19.04     36.54     7.82                             __fentry__
  9.30     40.36     3.82   200000    19.10   179.64  pxlRendererDrawTriangle
  1.44     40.95     0.59   200000     2.95    57.60  pxlRendererDrawRect
  0.17     41.02     0.07   200000     0.35     3.45  pxlRendererDrawLine
  0.05     41.04     0.02   200000     0.10     0.10  pxlImageClearColor
  0.05     41.06     0.02   200000     0.10     0.10  pxlWindowPresent
  0.02     41.07     0.01                             main
  0.00     41.07     0.00  2000000     0.00     0.00  pxlRendererSetDrawColor
  0.00     41.07     0.00   200000     0.00     0.00  pxlGetKey
  0.00     41.07     0.00   200000     0.00     0.10  pxlRendererClearColor
  0.00     41.07     0.00   100000     0.00     0.00  pxlWindowPollEvents
  0.00     41.07     0.00       10     0.00     0.00  _pxlFree
  0.00     41.07     0.00       10     0.00     0.00  _pxlMalloc

From 200 us/call to 0.1 us/call

The bottleneck now is _mcount_private, which is what gprof uses to record timing of functions.

79402422