I solve this question. the problem is every time we run the porgram need to clean the gpu environment. code is
def clear_nccl_environment(): dist.barrier() torch.cuda.empty_cache()