Reports

As nVidia documentation and 2018 forum discussion say, the context memory overhead is dependent on the number of streaming multiprocessors (SMs) implemented on your CUDA device core, and there is sadly no known method to determine this behaviour. But it is only part of the answer. The actual overhead may dramatically depend on the host OS, as it was reported before for Windows in this answer. The answer is quite new (2021), so the issue may be still present in your setup. But as I see here, likely you have the strange threading model issue also described (but sadly not solved!) here.

As it is described here, the solution may be to run everything in the single host process. If it is not an option, it seems the best way to look at nVidia MPS, and here is an excellent answer about it: https://stackoverflow.com/a/34711344/9560245

79155404