This question is exactly something I'm thinking about but I want to go a step further. Once we have all the "raw data" indices of the incorrectly labeled data: what do we do? What sort of things can be done to analyze WHY something got labeled incorrectly?
There is LOTS of information on how to score the model performance, but what is the next level of troubleshooting? How do we start to analyze WHY things are being mislabeled?