I believe the reason is that the gradients are calculated for each epoch for every weight. If one epoch is over, the gradient found for the weights will be there for those weights. If we start the next epoch the gradients will be again calculated for the neurons and if the previous gradient value is already there then it will get added up to the new gradient thus making the result wrong.
EX:
1 epoch: gradient : 2
2 epoch: gradient : 3
if the zero grads are not used then the model will add both gradients and make a move-like. okay, 2+3 is 5 so we should reduce the weight by -5 so that I can reduce the loss function. But the needed gradient is just 3.
Hey, this is my point of view. If I am wrong kindly guide me