Reports

Check out a userscript which highlights deleted posts. GitHub

79238393

Date: 2024-11-29 21:26:12

Score: 2

Natty:

I tracked down full details of layer normalization (LN) in BERT here.

Mean and variance are computed per token. But the weight and bias parameters learned in LN are not per token - it's per embedding dimension.

Reasons: