The model predicts the same output due to overfitting to a trivial solution on the small dataset. Low pre-training diversity occurs because the frozen base model provides static features which leads to a narrow output range from the randomly initialized final layers. Please refer to the gist for your reference.