Thanks a lot for sharing this knowledge, not just this. Trainer class of hugging face has a lot of issues. Their evaluation loop involves first storing all the predictions and then computing the metrics. For NLP, this can take up a lot of memory and one may even face out of memory errors. It is best to write your own pytorch training loop.