The model definition looks right. Surely, you have also tried other models that are smaller due to the occurrence of overfitting. You model overfitts the data. This shows the first image (Lower Loss for train and always higher loss for valid/test). Your training and validation data may not be well distributed or contain inconsistent labels, which confuses the model and leads to high variability in the training metrics. Your trainingsratio is 91% and your eval ratio is 9%. The difference is relatively large for such a small amount of data. Overall, the number of data is quite small. It correct that you are using data augmentaion, to create more data. "I suspect that the difference in quality and size between the training data and the evaluation data is too large.