Although the Sentence BERT improve the ability to evaluate of semantic similarity to BLEU, it lacks sufficient sensitivity to surface-level error such as spelling mistake, word order issue etc. According to this paper ( Evaluation of Metrics Performance ) research, I think the best evaluation is BLEU + Sentence BERT.