I faced a similar issue in the past while working on an imbalanced classification problem. Focal loss can be helpful, but sometimes it’s not enough on its own. Here are a few additional strategies that worked well for me:
Taking fewer data of the majority class or taking more from the minority one (even if for binary classification, it will both do the same, it is up to you to give meaning you want to the word "epoch"). When undersampling, don't forget to randomly select the data so you can still exploid the whole diversity of the majority class. With that you will get a better distribution but then you might be careful not to overfit the minority class. Here's why the second point comes in.
Of course, this will depend a lot on your data but as long as it stays recognizable it might work well with the sampling trick.
Just like "strong augmentation" it will depend a lot on the data you're manipulating but I ended up combining data to enhance the classification. With some kind of morphism, you could train using a X% positive data.