Instructions to use KRAFTON/Raon-OpenTTS-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- F5-TTS
How to use KRAFTON/Raon-OpenTTS-1B with F5-TTS:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Raon-OpenTTS-1B
Technical Report | Code | Dataset | Raon-OpenTTS-0.3B
Raon-OpenTTS is an open-data, open-weight zero-shot TTS system that performs on par with state-of-the-art closed-data models. This is the 1B variant.
Key Features
- Fully Open: Both model weights and training data (615K hours, 11 English speech datasets) are publicly available for reproducible TTS research.
- Competitive with Closed-Data SOTA: Ranks 1st or 2nd in WER and SIM among recent zero-shot TTS models on Seed-TTS-Eval and CV3-Eval, matching systems trained on millions of hours of proprietary data.
- Robust Across Acoustic Conditions: Achieves the best average WER and SIM on Raon-OpenTTS-Eval across Clean, Noisy, Wild, and Expressive regimes.
- Large-Scale Curated Data: Trained on Raon-OpenTTS-Core (510K hours), quality-filtered from Raon-OpenTTS-Pool using combined DNSMOS, WER, and VAD rank-based filtering.
- DiT Architecture: Based on F5-TTS Diffusion Transformer with flow matching, enabling efficient zero-shot speech synthesis.
Model Details
| Parameters | 1048M |
| Architecture | DiT (Diffusion Transformer), based on F5-TTS |
| Config | dim=1408, depth=28, heads=24, ff_mult=4, text_dim=512, conv_layers=4 |
| Training Data | Raon-OpenTTS-Core (510.1K hours) |
| Steps | 520K updates |
| Hardware | 48 NVIDIA B200 GPUs |
| Batch Size | 2,688K frames (14K/GPU x 192 GPUs) |
| Optimizer | AdamW, peak LR 1e-4, 50K warmup, linear decay, grad norm 1.0 |
| Audio | 80-ch mel-spectrogram, 16kHz, hop=256 |
| Vocoder | HiFi-GAN (speechbrain/tts-hifigan-libritts-16kHz) |
Benchmark Results
Bold marks the best result and the Raon-OpenTTS rows. All numbers are from the technical report.
Seed-TTS-Eval
WER measured via Whisper-large-v3; SIM via WavLM-large.
| Model | Params | WER (%) β | SIM β |
|---|---|---|---|
| Human | - | 2.14 | 0.734 |
| Seed-TTS | - | 2.25 | 0.762 |
| CosyVoice 3 | 1.5B | 2.21 | 0.720 |
| Index-TTS 2 | 1.5B | 2.18 | 0.709 |
| Llasa | 8B | 3.63 | 0.581 |
| VoxCPM | 0.5B | 1.98 | 0.730 |
| CosyVoice 2 | 0.5B | 2.61 | 0.659 |
| CosyVoice 3 | 0.5B | 2.50 | 0.698 |
| Qwen3-TTS | 1.7B | 1.46 | 0.715 |
| Voxtral TTS | 4B | 2.19 | 0.663 |
| MaskGCT | 0.6B | 2.57 | 0.713 |
| F5-TTS | 0.3B | 2.04 | 0.671 |
| Raon-OpenTTS-0.3B | 0.3B | 1.95 | 0.687 |
| Raon-OpenTTS-1B | 1.0B | 1.78 | 0.749 |
CV3-Eval
WER on CV3-EN and CV3-Hard-EN; SIM via ERes2Net, DNSMOS for perceptual quality (CV3-Hard-EN).
| Model | CV3-EN WER (%) β | CV3-Hard-EN WER (%) β | CV3-Hard-EN SIM β | CV3-Hard-EN DNSMOS β |
|---|---|---|---|---|
| F5-TTS | 8.54 | - | - | - |
| MaskGCT | 7.73 | 41.09 | 0.624 | 3.48 |
| CosyVoice 2 | 6.27 | 10.28 | 0.710 | 3.95 |
| CosyVoice 3 | 4.96 | 10.77 | 0.740 | 3.98 |
| VoxCPM | 5.24 | 6.44 | 0.670 | 3.78 |
| Qwen3-TTS | 4.52 | 7.89 | 0.666 | 3.87 |
| Raon-OpenTTS-0.3B | 4.62 | 7.31 | 0.730 | 3.77 |
| Raon-OpenTTS-1B | 3.92 | 6.15 | 0.775 | 3.85 |
Raon-OpenTTS-Eval
4 acoustic regimes (Clean, Noisy, Wild, Expressive), 12 datasets, 6K prompt-text pairs. Overall is computed over all evaluation samples.
| Model | Clean WER β | Clean SIM β | Noisy WER β | Noisy SIM β | Wild WER β | Wild SIM β | Expr. WER β | Expr. SIM β | Overall WER β | Overall SIM β |
|---|---|---|---|---|---|---|---|---|---|---|
| F5-TTS | 2.17 | 0.613 | 3.82 | 0.640 | 136.03 | 0.324 | 3.46 | 0.503 | 25.08 | 0.542 |
| MaskGCT | 3.39 | 0.672 | 5.56 | 0.727 | 28.00 | 0.581 | 6.44 | 0.546 | 8.61 | 0.635 |
| CosyVoice 2 | 2.59 | 0.642 | 4.39 | 0.675 | 49.73 | 0.535 | 3.66 | 0.536 | 11.02 | 0.603 |
| CosyVoice 3 | 2.53 | 0.678 | 3.69 | 0.720 | 8.31 | 0.618 | 5.49 | 0.567 | 4.43 | 0.647 |
| VoxCPM | 2.24 | 0.686 | 3.42 | 0.738 | 43.83 | 0.553 | 2.66 | 0.565 | 9.48 | 0.642 |
| Qwen3-TTS | 3.38 | 0.684 | 4.60 | 0.726 | 79.14 | 0.528 | 5.81 | 0.527 | 17.59 | 0.626 |
| Raon-OpenTTS-0.3B | 1.57 | 0.645 | 4.03 | 0.700 | 5.83 | 0.571 | 2.53 | 0.570 | 2.93 | 0.623 |
| Raon-OpenTTS-1B | 1.44 | 0.718 | 3.51 | 0.769 | 5.61 | 0.656 | 2.77 | 0.633 | 2.81 | 0.695 |
Inference
For inference code and usage instructions, see krafton-ai/Raon-OpenTTS.
Training Details
Raon-OpenTTS-1B was trained for 520K update steps on 48 NVIDIA B200 GPUs using the Raon-OpenTTS-Core dataset (510.1K hours of English speech). The model uses AdamW optimization with a peak learning rate of 1e-4, 50K warmup steps, and linear decay. Gradient norm is clipped at 1.0. Waveform synthesis uses a HiFi-GAN vocoder pretrained on LibriTTS at 16kHz.
Citation
@article{kim2026raonopentts,
title = {Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech},
author = {Kim, Semin and Chung, Seungjun and Moon, Taehong and Lee, Sangheon and Ahn, Minyoung and Lee, Keon and Kim, Nam Soo and Cho, Jaewoong and Schmidt, Ludwig and Lee, Kangwook and Park, Dongmin},
journal = {arXiv preprint arXiv:2605.20830},
year = {2026},
url = {https://arxiv.org/abs/2605.20830}
}
License
This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
Β© 2026 KRAFTON
- Downloads last month
- 23