79653228

Date: 2025-06-04 17:23:49
Score: 2
Natty:
Report link

Finally fixed the problem it was the gpu fault on kaggle.Using 2 gpu caused this error on 1 it is working fine.
The inf/nan loss being an artifact of the MirroredStrategy (multi-GPU) execution, specifically likely one of these: One of the GPUs calculates an inf (either loss or gradient) for its shard of data around batch 120, which then taints the CollectiveReduceV2 result.

Reasons:
  • No code block (0.5):
  • Self-answer (0.5):
  • Low reputation (1):
Posted by: Harsh Panwar