The issue appears to have been resolved by placing interleave on;
in both application blocks. I assume that this makes it impossible for audio and video data to arrive separately, and therefore much more difficult for a video player to misconstrue the number of streams in the data. It feels a bit hacky, because I still don't understand the issue, but it seems to have worked.