79384432

Date: 2025-01-24 13:04:32
Score: 1.5
Natty:
Report link

The visual and description in this blog post OP referred to on how the decoder blocks work is misleading.

The last encoder block does not produce K and V matrices.

Instead, the cross-attention layer L in each decoder block takes the output Z of the last encoder block and transforms them to K and V matrices using the Wk and Wv projection matrices that L has learned. Only the Q matrix in L is derived from the output of the previous (self-attention) layer. As the original paper states in section 3.2.3:

In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

Also refer to Fig. 1 in the original paper.

Thus, each token fed to the decoder stack attends to all tokens fed to the encoder stack via the cross-attention layers and all previous tokens fed to the decoder stack via the (masked) self-attention layers.

Reasons:
  • Blacklisted phrase (1): this blog
  • Long answer (-0.5):
  • No code block (0.5):
  • Low reputation (0.5):
Posted by: Fijoy Vadakkumpadan