Reports

Gradient clipping is used to limit the gradients of the model during training so they do not get too large and cause instability. It clip the gradient values before updating parameters. Suppose our clip range is (-5,5) if gradient value is 6.4 it will clip it down as 6.4/5 = 0.78. This is commonly used in where backpropagating through long sequences of hidden states is required such as in RNNs, LSTMs, and sometimes Transformers.

BatchNorm is a trainable layer. During training, it normalises the output of a layer so the mean is 0 and variance is 1 for each channel in the batch. This keeps all output elements on a similar scale — for example, preventing a case where one output value is 20000 and another is 20, which could make the model over-rely on the larger value. BatchNorm is mostly used in models such as CNNs, feed forward neural nets and other models that perform fixed computation.

Conclusion: Both solve different problems in different parts of the training process — gradient clipping handles exploding gradients in the backward pass, while BatchNorm2d stabilises activation scales in the forward pass.

79732018