Do not apply your mask to the frames. Instead, transform the mask into audio signal space, and then apply the mask there (to Y).