Configuration Parsing Warning:Invalid JSON for config file config.json
Amharic XTTS-v2 (Adapted) by Spitch AI
Amharic TTS model adapted from XTTS-v2, trained on approximately 500 hours of data by Spitch AI (Lagos, Nigeria).
Repository Contents
checkpoint_232000.pth(GPT checkpoint)dvae.pthmel_stats.pthconfig.jsonvocab.jsonrequirements.txtamharic_financial_normalizer.py
Quick Setup
pip install -r requirements.txt
Inference (Python)
import pandas as pd
if not hasattr(pd.DataFrame, 'map'):
pd.DataFrame.map = pd.DataFrame.applymap
import torch
import torchaudio
import epitran
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from amharic_financial_normalizer import normalize_amharic_financial_text
CHECKPOINT_PATH = "checkpoint_232000.pth"
CONFIG_PATH = "config.json"
VOCAB_PATH = "vocab.json"
DVAE_PATH = "dvae.pth"
MEL_STATS_PATH = "mel_stats.pth"
config = XttsConfig()
config.load_json(CONFIG_PATH)
config.model_args.dvae_checkpoint = DVAE_PATH
config.model_args.mel_norm_file = MEL_STATS_PATH
config.model_args.vocab_file = VOCAB_PATH
model = Xtts.init_from_config(config)
model.load_checkpoint(
config,
checkpoint_path=CHECKPOINT_PATH,
vocab_path=VOCAB_PATH,
use_deepspeed=False,
)
if torch.cuda.is_available():
model.cuda()
epi_am = epitran.Epitran("amh-Ethi-pp")
def preprocess_text(text: str, lang: str = "am") -> str:
if lang == "am":
# 1) Normalize numbers and money expressions first.
text = normalize_amharic_financial_text(text)
# 2) Transliterate Ethiopic script for model robustness.
text = epi_am.transliterate(text)
return text
ref_audio = "reference.wav" # 3-10 seconds, clean speech
text_am = "ወደ 77777 ብሠ250.50 ያስገቡ"
processed_text = preprocess_text(text_am, "am")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[ref_audio])
out = model.inference(
text=processed_text,
language="am",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.5,
length_penalty=1.0,
repetition_penalty=2.0,
top_k=50,
top_p=0.8,
enable_text_splitting=True,
)
torchaudio.save("output_amharic.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
print("Saved output_amharic.wav")
Notes
normalize_amharic_financial_text(...)handles amounts, decimals, phone-like IDs, and short-codes for TTS-friendly Amharic speech.- Best results come from high-quality reference audio.
- Output sample rate is
24000Hz.
- Downloads last month
- -