If the model hasn't been trained enough, it might just learn to repeat a single token. During inference, if the decoder input isn't updated correctly at each time step, it might keep predicting the same token. Without attention, it can be harder for the model to learn long dependencies, especially in vanilla encoder-decoder setups