for anyone who is stumbled on this issue:
it is happening because the video asset has more than 1 audio track, I guess the first is the plain old stereo track to be compatible with older players, while the rest is a spatial audio track that contains more channels.
then things now become easier -- just process the first audio track, and neglect the rest.