Reports

Figured it out, it was related to the work sizes.

By setting the local_work_size to NULL I think it's iterating single process through the seed_ranges, if you set the global_work_size to 28 (number of cores) and the local_work_size to 1 then it will fully utilise the CPU.

I didn't change the work_dim though.

uint64_t global = num_seed_ranges; // 28 in my case
uint64_t local = 1;
error = clEnqueueNDRangeKernel(
    commands, //command queue
    ko_part_b, // kernel
    1, NULL, // work dimension stuff
    &global, // global work size (num of cores) 
    &local, // local work size (1)
    0, NULL, NULL // event queue stuff
);

Final Results:

C Single thread - 4 mins
C OpenMP - 23 seconds
C OpenCL - 9 seconds
Rust single threaded - 1.5 mins
Rust rayon multiprocess - 7 seconds
Cuda 3072 cores (2000 series) - 9 seconds

79807088