After further investigation, I've found why the kernels are being run in serial. By default, after cuda 12.2, a setting called CUDA_MODULE_LOADING
is set to lazy. The cuda C++ programming guide outlines issues with lazy CUDA_MODULE_LOADING
with respect to concurrent execution of kernels:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution
Concurrent execution of kernels is described by the guide as an anti-pattern, but a workaround is to set the environment variable: CUDA_MODULE_LOADING=EAGER