You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Configuration Parsing Warning:Invalid JSON for config file config.json

Amharic XTTS-v2 (Adapted) by Spitch AI

Amharic TTS model adapted from XTTS-v2, trained on approximately 500 hours of data by Spitch AI (Lagos, Nigeria).

Repository Contents

checkpoint_232000.pth (GPT checkpoint)
dvae.pth
mel_stats.pth
config.json
vocab.json
requirements.txt
amharic_financial_normalizer.py

Quick Setup

pip install -r requirements.txt

Inference (Python)

import pandas as pd
if not hasattr(pd.DataFrame, 'map'):
    pd.DataFrame.map = pd.DataFrame.applymap

import torch
import torchaudio
import epitran
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from amharic_financial_normalizer import normalize_amharic_financial_text

CHECKPOINT_PATH = "checkpoint_232000.pth"
CONFIG_PATH = "config.json"
VOCAB_PATH = "vocab.json"
DVAE_PATH = "dvae.pth"
MEL_STATS_PATH = "mel_stats.pth"

config = XttsConfig()
config.load_json(CONFIG_PATH)
config.model_args.dvae_checkpoint = DVAE_PATH
config.model_args.mel_norm_file = MEL_STATS_PATH
config.model_args.vocab_file = VOCAB_PATH

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path=CHECKPOINT_PATH,
    vocab_path=VOCAB_PATH,
    use_deepspeed=False,
)

if torch.cuda.is_available():
    model.cuda()

epi_am = epitran.Epitran("amh-Ethi-pp")


def preprocess_text(text: str, lang: str = "am") -> str:
    if lang == "am":
        # 1) Normalize numbers and money expressions first.
        text = normalize_amharic_financial_text(text)
        # 2) Transliterate Ethiopic script for model robustness.
        text = epi_am.transliterate(text)
    return text


ref_audio = "reference.wav"  # 3-10 seconds, clean speech
text_am = "ወደ 77777 ብር 250.50 ያስገቡ"
processed_text = preprocess_text(text_am, "am")

gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[ref_audio])
out = model.inference(
    text=processed_text,
    language="am",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.5,
    length_penalty=1.0,
    repetition_penalty=2.0,
    top_k=50,
    top_p=0.8,
    enable_text_splitting=True,
)

torchaudio.save("output_amharic.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
print("Saved output_amharic.wav")

Notes

normalize_amharic_financial_text(...) handles amounts, decimals, phone-like IDs, and short-codes for TTS-friendly Amharic speech.
Best results come from high-quality reference audio.
Output sample rate is 24000 Hz.

Downloads last month: -