I tracked down full details of layer normalization (LN) in BERT here.
Mean and variance are computed per token. But the weight and bias parameters learned in LN are not per token - it's per embedding dimension.