Text-to-Speech
ONNX
English

Nice ~90x real-time generation on 3090TI. Quickstart provided.

#20
by ubergarm - opened

I first tried an ONNX implementation, but the PyTorch implementation seems much faster for my homelab setup.

kokoro-tts pytorch quickstart

Here is how I got the the PyTorch implementation running on CUDA for benchmarking and testing vs this particular ONNX implementation repo.

# grab hf repo code but not large files (or use git lfs or `huggingface-cli` etc)
git clone https://huggingface.co/hexgrad/Kokoro-82M
cd Kokoro-82M

# put the model and at least one voice file into place manually overwriting LFS placeholders
wget -O kokoro-v0_19.pth 'https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/kokoro-v0_19.pth?download=true'
wget -O voices/af_sky.pt 'https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/voices/af_sky.pt?download=true'

# setup venv
python -m venv ./venv
source ./venv/bin/activate

# install deps (can use `uv pip` instead)
pip install phonemizer torch transformers scipy munch soundfile

# install OS level required binaries
# on debian / ubuntu flavors
sudo apt-get install espeak-ng
# or on ARCH btw...
sudo pacman -Sy extra/espeak-ng
# confirm it is working and in path
espeak-ng --version
eSpeak NG text-to-speech: 1.52.0  Data at: /usr/share/espeak-ng-data

# now run the main.py example like so and note the "real" time (wall-clock time)
time python main.py

Here is the contents of the main.py example file including naive chunking of input text by using . punctuation. Need a better chunking implementation to avoid Truncated to 510 tokens error.

from models import build_model
import torch
import soundfile as sf
from kokoro import generate

SAMPLE_RATE = 24000
OUTPUT_FILE = "output.wav"

TEXT = """
Input a long text here. As long as it has an occasional period.
Then it won't overflow and truncate.
You can do better chunking than this with a little effort.
But this is enough to see how fast it can go!

Are there parallel batching options if you have enough VRAM? Or max tokens options?
I haven't measured latency of time to first generation or tried keeping the model loaded.
"""

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Runnin on device: {device}")
MODEL = build_model("kokoro-v0_19.pth", device)
VOICE_NAME = "af_sky"
VOICEPACK = torch.load(f"voices/{VOICE_NAME}.pt", weights_only=True).to(device)
print(f"Loaded voice: {VOICE_NAME}")

audio = []
for chunk in TEXT.split("."):
    print(chunk)
    if len(chunk) < 2:
        # a try except block for non verbalizable text is probably better than this hack
        continue
    snippet, _ = generate(MODEL, chunk, VOICEPACK, lang=VOICE_NAME[0])
    audio.extend(snippet)

sf.write(OUTPUT_FILE, audio, SAMPLE_RATE)

References

@ubergarm Fantastic stuff.

That reminds me, my Reddit account https://www.reddit.com/user/rzvzn/ has been shadowbanned from r/LocalLLaMA for at least a month now. I have messaged the moderators, but it's been crickets so far. If I need more karma in order to post, then how am I supposed to obtain karma without being able to post?

Naturally, you'd assume that the shadowban is a result of doing really sus things or self-promoting, but 1) I do not think I've been doing such things and 2) All my posts and comments evaporate immediately, regardless of their content.

To the moderators of r/LocalLLaMA: If you could turn off friendly fire and take me out of the sunken place, I'd really appreciate it.

I have also seen posts & comments by others also get shadowbanned by the mere mention of rzvzn, which is also my Discord handle. One guy even had a post—which was gaining a good number of upvotes and views—get instantly removed by moderation because he edited it to tag me. Is this moderation run by Reddit, LocalLLaMA, or both? I'm obviously biased, but I think whoever is running moderation (including auto-mod) needs to seriously reevaluate what's going on over there.

hey @hexgrad i saw this post of yours:

https://www.reddit.com/r/LocalLLaMA/comments/1hwf4jm/second_take_kokoro82m_is_an_apache_tts_model/

on reddit. thats how i noticed your model. So i would think you are not shadowbanned now?

anyway:
WOW kokoro is the best model i have seen so far with a permissive licence (Apache) BY FAR .. so BIG THANKS!!!!!!!!!!!!!!!!!!!!!!!!!!!

i am so happy to have found it! it sounds great and is ultra fast.
Do you plan to implement spanish, German, italian, portuguese? to have the main europeran languages?

would be great to see these!

thanks for the great work

@hexgrad i have a similiar issue with localllama and i am really tired of it... since 7 months i can't post anything anymore - only commenting. I've messaged the mods many times as well but never get an answer.

i am one of the users who have been there since the beginning of localllama and have regularly contributed good posts. i have no idea why i have been blocked there for over half a year now.
as i read more and more complaints like this, it looks to me like the mods are not able to manage this group properly. i am therefore seriously wondering if i should leave and boycott this club because these circumstances are too childish and too stressful for me.

Thanks @ubergarm , your way of saving wav files works! I tried both pydub and simpleaudio, and both the resulting files are full of noise (and honestly, I don't understand why). Only sf.write() seems to work for me 🤔

#!/usr/bin/env python3

# 2️⃣  Build the model and load the default voicepack

import sys

import torch
import numpy as np

from models import build_model
from kokoro import generate

import simpleaudio as sa

from pydub import AudioSegment
from pydub.playback import play

import soundfile as sf

device = 'cuda' if torch.cuda.is_available() else 'cpu'
#MODEL = build_model('kokoro-v0_19.pth', device)
MODEL = build_model('fp16/kokoro-v0_19-half.pth', device)
VOICE_NAMES = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
]

VOICE_NAME = VOICE_NAMES[9]
VOICEPACK  = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
print(f'Loaded voice: {VOICE_NAME}')

# 3️⃣  Call generate, which returns 24khz audio and the phonemes used
text = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"

# Language is determined by the first letter of the VOICE_NAME:
# 🇺🇸 'a' => American English => en-us
# 🇬🇧 'b' => British English  => en-gb

audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])

print(f'{type(audio) = } - {len(audio)} bytes - {audio.dtype = } - {audio.shape = }')
print(f'{type(out_ps)} - {len(out_ps)} chars')
print(f'{out_ps}')

framerate = 24000

audio_segment = AudioSegment(audio.tobytes(), frame_rate=framerate, sample_width=audio.dtype.itemsize, channels=1)
audio_segment.export('/tmp/audio.wav', format='wav')
print(f'Playing pydub audio...')
play(audio_segment)

wave_obj = sa.WaveObject.from_wave_file("/tmp/audio.wav")
print(f'Playing simpleaudio audio...')
play_obj = wave_obj.play()
play_obj.wait_done()

audio2 = wave_obj.audio_data
print(f'{type(audio2) = } - {len(audio2)} bytes')
AudioSegment(audio2, frame_rate=framerate, sample_width=4, channels=1).export('/tmp/audio2.wav', format='wav')

sf.write('/tmp/audio3.wav', audio, framerate)

Sign up or log in to comment