Reports

I did some research and I think the thing is that Bidirectional Language Models are not Causal Language Models and don't learn and work in an autoregressive style.

They only learn to predict one next token after the input sequence. And tokens from the input themselves.

They can simply duplicate the input though, working like an autoencoder. To avoid this and make such models predict something that they can't just see from the input, we can use Masked Language Modeling (MLM), like they did in BERT.

79486850