Figured it out, it was related to the work sizes.
By setting the local_work_size to NULL I think it's iterating single process through the seed_ranges, if you set the global_work_size to 28 (number of cores) and the local_work_size to 1 then it will fully utilise the CPU.
I didn't change the work_dim though.
uint64_t global = num_seed_ranges; // 28 in my case
uint64_t local = 1;
error = clEnqueueNDRangeKernel(
commands, //command queue
ko_part_b, // kernel
1, NULL, // work dimension stuff
&global, // global work size (num of cores)
&local, // local work size (1)
0, NULL, NULL // event queue stuff
);
Final Results:
C Single thread - 4 mins
C OpenMP - 23 seconds
C OpenCL - 9 seconds
Rust single threaded - 1.5 mins
Rust rayon multiprocess - 7 seconds
Cuda 3072 cores (2000 series) - 9 seconds