I’m the developer of this Chrome extension that transcribes audio using Whisper AI. I’m currently working on an update focused specifically on real-time transcription
After testing various approaches, I found that true streaming with Whisper isn’t yet possible in the strict sense (as Whisper requires full chunks of audio). However, the most reliable solution I’ve implemented is processing 15-second audio blocks in near-real-time. This allows the app to simulate streaming with acceptable latency and stable transcription quality
I ran several experiments and found that: • Shorter blocks (e.g., 5–10 sec) often lead to poor language model context and lower accuracy. • Longer blocks increase latency and risk losing responsiveness. • 15 seconds strikes the best balance between speed and transcription quality.
So if you’re looking to simulate real-time transcription using Whisper’s API, slicing the input into 15s segments and processing each one as it completes is currently the most practical method