Reports

Blacklisted phrase (1): this tutorial
Blacklisted phrase (1): youtube.com
Long answer (-1):
No code block (0.5):
Low reputation (0.5):

I realize that this question is from a year ago, but in my search for a small tts model, I came across this question, and wanted to share a model that in my experience trains pretty well on just a few hours of audio.

In my case I had to train a language with no prior pre-trained model, so that was a challenge, but still with just a 2 hours dataset of audio and transcript, I managed to get quite intelligible speech out of training it for less than 15 hours on my GPU (RTX 4080 with 16GB of vRam). It was not a model that's ready for the market, but still is quite decent. And that was training on a brand new language for the model.

Not sure if this still has your interest now, a year later, but take a look at Piper TTS (Open Source). That is at least one model that can work pretty well on a small dataset. I would be cautious of expecting a lot, if you only have 20 sentences of audio, but if you train on a language that has a pre-trained checkpoint, I'm pretty sure it would work with somewhat decent quality, if you train on top of that. https://github.com/rhasspy/piper

It is small enough to run on a Raspberry PI, and performs really well. Also take a look at this tutorial, on training the model: https://www.youtube.com/watch?v=b_we_jma220

79197490