Reports

After further investigation, I've found why the kernels are being run in serial. By default, after cuda 12.2, a setting called CUDA_MODULE_LOADING is set to lazy. The cuda C++ programming guide outlines issues with lazy CUDA_MODULE_LOADING with respect to concurrent execution of kernels:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution

Concurrent execution of kernels is described by the guide as an anti-pattern, but a workaround is to set the environment variable: CUDA_MODULE_LOADING=EAGER

79403605