Finally fixed the problem it was the gpu fault on kaggle.Using 2 gpu caused this error on 1 it is working fine.
The inf/nan loss being an artifact of the MirroredStrategy (multi-GPU) execution, specifically likely one of these: One of the GPUs calculates an inf (either loss or gradient) for its shard of data around batch 120, which then taints the CollectiveReduceV2 result.