Original paper used popularity-sampled metrics, whereas RecBole most likely uses non-sampled versions. They aren't really comparable. (using non-sampled is right)
20 epochs is too little to train proper version of BERT4Rec on ml-1m. Try to increase 10X.
RecBole had a number of differences with original BERT4Rec; which led to sub-optimal effectiveness. I think most of them were fixed, so make sure that you're using the latest version.
Original paper used a version of ML-1M from SASRec repo that had some pre-processing. Make sure that you're using the same version.
You can also look into our reproducibility paper, where looked into some of the common reasons of disrepancies https://arxiv.org/pdf/2207.07483.