More technical information
Hi. I wondered if you could share more technical information about the design of Kokoro. Including how it differs from StyleTTS2? For example, how did you select the number of parameters and the topology of the network?
How the model was initialized and trained. What data was in the training mix. etc.
His comments about the design have been
Kokoro v0.19 was trained on relatively little data and transparently uses a StyleTTS 2 architecture
https://huggingface.co/hexgrad/Kokoro-82M/discussions/19
Kokoro quite transparently omits the style diffusion element of StyleTTS2, as I personally do not believe it is worth the ~25M additional parameters, but I could be wrong about that.
https://huggingface.co/hexgrad/Kokoro-82M/discussions/19#6781ae189a9941c184a86164
Looking at the StyleTTS 2 Repo and demo website:
Section 8 "Ablation Study" lists this:
Baseline: Our proposed model, StyleTTS 2.
No Style Diffusion: This variant encodes style vectors from random references rather than sampling them from style diffusion, as in the original StyleTTS. The model is identical to the baseline model, except style diffusion is not used during inference. This modification impacts all aspects of the speech, including pauses, emotions, speaking rates, and sound quality, as these factors are highly correlated with the style vector, which in turn most significantly affects naturalness in our experiment.
No Prosodic Style Encoder:...
It seems the Kokoro model is literally just retraining of the StyleTTS model on new high quality data, without the final step; which the creator deems unnecessary given the slow down it causes for the benefit it gives to quality. And judging by the quality of the audio samples on "Section 8" of the StyleTTS 2 website I agree, it does seem to make very little difference.
One of the interesting things about this to me then is this implies that the training data quality is of paramount importance. Kokoro sounds amazing, and was only trained on <100hr of audio. StyleTTS 2 was on 245 Hours (already a low amount), vs VALL-E (-60k Hours), NaturalSpeech 2 (-44k Hours) and others with much greater amounts of training data.