Instead of trying to make the partial results render the way final results might, I started to think "ok, well, how could I render these like closed captions so that I can stop worrying about timestamps", but my partial text was getting processed ahead of the audio because it's throwing buffers at the transcriber, so as-is I did not have a workable solution with that angle.
I briefly considered trying to calculate buffer timing information and carefully throw buffers at the transcriber as-needed, and that led me to thinking of the various examples you can find online about tapping the mic and rendering the transcription text. Sure enough I could tap the node I'm playing my audio stream on and it works reasonably well. It lags a bit because the processing takes time, so I still might try the "try to time the buffers" solution to see if I can get the timing right, but even if that doesn't work out this tapping the stream solution is reasonably decent.