I understand the answer from @artoby, but isn't the linear layer (or feed forward or thinking layer) after the self-attention destroying this information flow as it pulls in information from other tokens to previous token?