In my case, none of the other solutions worked, and the issue was due to enabling mixed precision training with PyTorch's AMP (Automatic Mixed Precision).
The fix was to disable AMP entirely, as no other workaround resolved the NaNs. One potential solution if you still want to use AMP is to adjust the gradient scaling, as discussed in this post. However, I tried various scaling values, and it still didn't work correctly in my setup.