Use with torch.no_grad(): during inference to avoid storing gradients and use mixed precision (torch.cuda.amp) to cut memory usage.
torch.cuda.empty_cache() does not “kill” memory still referenced by active objects. To truly free GPU memory- del unused variables, call gc.collect() and torch.cuda.empty_cache()