I've recently encountered an unexpected performance issue while running the Whisper Turbo V3 ASR model on NVIDIA GPUs. When inferencing via Triton Inference Server, the model exhibits better performance on a V100 GPU compared to an A100 GPU. This is surprising since the A100 is significantly more powerful and optimized for AI workloads.
Observations:
Latency and Throughput: Lower latency and higher throughput were observed on the V100.
Model and Environment: The Whisper Turbo V3 model is the same in both cases, and the Triton configurations are identical.
Any suggestion why this might happen. Thanks in advance for any help!