Some of the answers above are missing a vital piece of information:
L1 or L2 regularization of a vector of parameters IS NOT THE SAME AS NORM of that vector of respective order.
It is NOT WRONG to apply different regularization to different layers or even on selected model parameters only. It is only a best practice to apply it consistently on all model parameters. For example, what if I want first layer of my model to be more interpretable than the second? Given that,
Given the above reasons,
l1_regularization += torch.norm(param, 1)**2
should be modified to:
l1_regularization += torch.norm(param, 1)
because norm of order 1 wasn't square-root(ed) and **2
would make the term as |w|^2
, which is not even l1 or l2 :P
l2_regularization = lambda2 * torch.norm(all_linear2_params, 2)
should be modified to:
l2_regularization = lambda2 * torch.norm(all_linear2_params, 2)**2
because norm of order 2 is square-root(ed) and l2 regularization should be without square-root, although adding a square root version would simply scale down the gradients.