Reports

After quite a bit of research and code review, I was able to isolate details of layer normalization (LN), an aspect of transformers that's confusing a lot of people.

TL;DR: The assumption I made in my original question that each mean, std pair has its own weight, bias pair is incorrect. In LN, mean and std stats are computed across embedding dimensions of each token, i.e., there are as many mean, std pairs as there are tokens. But the weight and bias values are learned per embedding dimension, i.e., tokens share the weight and bias values during LN. This means, in the case of BERT base, there are a max of 512 mean and std values, and there are 768 weight and bias values.

For complete details, see my answer to this question.

79238368