79570317

Date: 2025-04-12 10:35:31
Score: 0.5
Natty:
Report link

My reading of the paper is that the learned embeddings are tied for the source and target language, and the same weight matrix is used to decode the decoder representations into next token logits.

Tokenizers such as byte level BPE will encode basically any language (eg expressed in utf-8) into the same token vocabulary, so you only need one embedding matrix to embed these tokens. The embedding associates with each integer token a vector of size $d_\text{model}$. This is an internal representation for that token. At the end of the decoder stack, the decoder features are compared again (dot product) with these representations to get the next token logits.

Reasons:
  • Long answer (-0.5):
  • No code block (0.5):
  • Low reputation (0.5):
Posted by: user27182