As @j-kadditz mentioned I probably should parallelize the process using SIMD or other parallelizing utilities, which is what I eventually will end up doing.
But, I implemented a VERY fast and efficient algorithm, all thanks to @weather-vane who suggested the idea (back when this post was on the Staging Ground).
You basically write the first pixel (4 bytes), then for each iteration we double the size of memory to be written. So we write 4 bytes, the next iteration memcpy
the data now it's 8 bytes, the next iteration memcpy
everything again now it's 16 bytes, again and again while checking that doubling the block doesn't exceed the total image size. If it does, just write the remaining pixels/bytes.
void pxlImageClearColor(PXLimage* image, PXLcolor color)
{
uint32_t nbytes = image->width * image->height * 4;
memcpy(image->data, color.rgba, 4);
uint32_t bytes_filled = 4;
uint32_t next_fill;
while (bytes_filled < nbytes)
{
next_fill = bytes_filled << 1;
if (next_fill > nbytes)
{
next_fill = nbytes;
}
memcpy(image->data + bytes_filled, image->data, next_fill - bytes_filled);
bytes_filled = next_fill;
}
}
I profiled using gprof
and check this out
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
47.67 19.58 19.58 _mcount_private
22.25 28.72 9.14 1137265408 0.01 0.01 pxlImageSetPixelColor
19.04 36.54 7.82 __fentry__
9.30 40.36 3.82 200000 19.10 179.64 pxlRendererDrawTriangle
1.44 40.95 0.59 200000 2.95 57.60 pxlRendererDrawRect
0.17 41.02 0.07 200000 0.35 3.45 pxlRendererDrawLine
0.05 41.04 0.02 200000 0.10 0.10 pxlImageClearColor
0.05 41.06 0.02 200000 0.10 0.10 pxlWindowPresent
0.02 41.07 0.01 main
0.00 41.07 0.00 2000000 0.00 0.00 pxlRendererSetDrawColor
0.00 41.07 0.00 200000 0.00 0.00 pxlGetKey
0.00 41.07 0.00 200000 0.00 0.10 pxlRendererClearColor
0.00 41.07 0.00 100000 0.00 0.00 pxlWindowPollEvents
0.00 41.07 0.00 10 0.00 0.00 _pxlFree
0.00 41.07 0.00 10 0.00 0.00 _pxlMalloc
From 200 us/call
to 0.1 us/call
The bottleneck now is _mcount_private
, which is what gprof
uses to record timing of functions.