Additionally, if you replace the memcpyAsync (Device to Host) operation with memcpy2DAsync (Device to Host), you can confirm that it runs in parallel. This makes it more confusing for me.