Spaces:
Running
on
Zero
when it will be available for open source community
when it will be available for open source community
Hey, Nicole is pretty great! Super looking forward to have her read books to me eventually for sleep :)
Hi there, thanks for your interest and glad you like it! There isn't a release date currently scheduled, but if that were to happen, it would definitely be signposted in this HF Space.
Regarding Nicole, I had to temporarily take her down while upgrading due to an architecture change described in this blog post — https://huggingface.co/blog/hexgrad/kokoro-short-burst-upgrade — but I will add her back very shortly. The other voices will also be restored when I have the bandwidth to do so later; feel free to ping me if one of the voices you were using is not currently available. None of them have been removed from the model, it's just a bit of effort to identify/extract the numerical params for each voice as I upgrade the model.
Hi Hexgrad, I’m very impressed with your voice modules. We have some H100 and H200 GPUs available. Let’s connect to discuss potential collaboration on building TTS models that support over 100 languages, with enhanced emotion control and voice cloning capabilities.
It's a 80M param model that can read english books in ASMR or respond in a highly energetic voice if u ask it to... great as it is imo and wouldn't push it for 100 languages and built-in enhanced emotion controls xD
...just saying. Emotion control could always be part of the inference code instead of built-in on the weights if this evolves to something bigger. Very happy to see this is going well and gathering attention though!
I actually didn't use any of the voices yet because I would rather use an actual API instead of gradio's. Wouldn't mind slow inference either assuming I can leave it running and playback the audio files later on.
Keep up the great work! Might wanna be more expressive with your expectations/hopes about the project though -- opening up for potential collaborators or funding.
If you choose to open source an early (/earlier) version, though, it could be leading and not necessarily take away any funding you'd otherwise get.
Reopening this for visibility so people have another pathway to see the response up here, since it appears to be the most FAQ. Also wanted to address the following:
I actually didn't use any of the voices yet because I would rather use an actual API instead of gradio's. Wouldn't mind slow inference either assuming I can leave it running and playback the audio files later on.
Working on something along those lines for batched/long-form inference, but still sticking with Gradio for the time being. I think I might be able to pull some tricks to get reasonable CPU inference speed, which avoids the GPU usage limits entirely. It's unclear how bad the latency will be though — if there is simply too much latency on Gradio's end, there are no tricks that can solve that.
I likely do not have the bandwidth to roll my own API solution for at least EOY. I was not impressed with Replicate's speed benchmarks or pricing, and I saw a RapidAPI horror story (it has since been taken down) which makes me hesitant to give them a shot.
Might wanna be more expressive with your expectations/hopes about the project though -- opening up for potential collaborators or funding.
This space has been getting regular updates, but I've been wanting to add an "Updates" tab to be more explicit about past and future updates. Will sit down and write that, somewhere between feature updates and testing the newest checkpoints fresh off the GPU.
What is the potential for this to run as realtime local tts.? Seems pretty quick.
please please please, open source this. I love this model it is amazing. quality and prosody are awesome. it's super fast and efficient. please make it open source
I also would like to use this model. The best feature is being able to put the ipa phonetics.
I also would like to use this model. The best feature is being able to put the ipa phonetics.
@bendangelo
While offtopic, but only slightly as Kokoro is a fine-tuned StyleTTS model, I'd like to mention that I cloned the first StyleTTS space to support ZeroGPU, enabled API and also allow to put IPA symbols within [] brackets.
https://huggingface.co/spaces/Pendrokar/style-tts-2
What is the potential for this to run as realtime local tts.? Seems pretty quick.
@Boosh
Within the Open TTS Tracker I noticed that StyleTTS streaming capability is mentioned. I don't think I added that.
https://huggingface.co/datasets/Pendrokar/open_tts_tracker
Probably refers to this fork of StyleTTS:
https://github.com/NeuralVox/StyleTTS2?tab=readme-ov-file#streaming-api
So maybe is a chance of getting the same for Kokoro
please please please, open source this. I love this model it is amazing. quality and prosody are awesome. it's super fast and efficient. please make it open source
its great, BUT will it be ever open source??
Kokoro v0.19 has been open sourced at https://hf.co/hexgrad/Kokoro-82M
It is a limited release, decoder-only with two voicepacks, for leaderboard result reproducibility. There currently isn't a release date scheduled for the other voices.
The weights are Apache 2.0 licensed. Merry Christmas!
Edit: For transparency, a model SHA256 hash equality check has been added to assert that the open sourced v0.19 model is identical to the v0.19 model used in this Space.
Kokoro v0.19 has been open sourced at https://hf.co/hexgrad/Kokoro-82M
It is a limited release, decoder-only with two voicepacks, for leaderboard result reproducibility. There currently isn't a release date scheduled for the other voices.
The weights are Apache 2.0 licensed. Merry Christmas!
Edit: For transparency, a model SHA256 hash equality check has been added to assert that the open sourced v0.19 model is identical to the v0.19 model used in this Space.
Thanks. We all love Kokoro, and we are thrilled about the newer version and, of course, more open-source voices. In the future, I wish someone could sponsor you to train more. The quality is fantastic, and the prosody is amazing.
Writing a paper on it would be amazing, too. You could explain how you achieved this quality superior to the original StyleTTS 2, what techniques you used, how the training data could affect the quality and prosody, and more.
there is a US Onyx voice in the demo that is really nice. is it availiable somewhere? I don't see it in the voices folder.
I am currently using your project and it has been very helpful. However, I was wondering if you might consider supporting the latest version of the open-source project?