79238393

Date: 2024-11-29 21:26:12
Score: 2
Natty:
Report link

I tracked down full details of layer normalization (LN) in BERT here.

Mean and variance are computed per token. But the weight and bias parameters learned in LN are not per token - it's per embedding dimension.

Reasons:
  • Low length (0.5):
  • No code block (0.5):
  • Self-answer (0.5):
  • Low reputation (0.5):
Posted by: Fijoy Vadakkumpadan