this is because dim = 64, head_num2 = 16, 64 // 16 = 4 and 4 is not divisible by 8. Pytorch becomes inefficient in this case.
To avoid this, also need to set dim = 128 as 128 // 16 = 8.