maybe I'm late to the party, but I suggest to modify the following lines:
model = pipeline(model="facebook/wav2vec2-base-960h")
data = np.frombuffer(audio.get_raw_data())
to
model = pipeline("automatic-speech-recognition",model="facebook/wav2vec2-base-960h")
data = np.frombuffer(audio.get_raw_data(),dtype=np.int16)
That's the difference between my code and yours.