Reports

The visual and description in this blog post OP referred to on how the decoder blocks work is misleading.

The last encoder block does not produce K and V matrices.

Instead, the cross-attention layer L in each decoder block takes the output Z of the last encoder block and transforms them to K and V matrices using the Wk and Wv projection matrices that L has learned. Only the Q matrix in L is derived from the output of the previous (self-attention) layer. As the original paper states in section 3.2.3:

In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

Also refer to Fig. 1 in the original paper.

Thus, each token fed to the decoder stack attends to all tokens fed to the encoder stack via the cross-attention layers and all previous tokens fed to the decoder stack via the (masked) self-attention layers.

79384432