You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Configuration Parsing Warning:Invalid JSON for config file config.json

Amharic XTTS-v2 (Adapted) by Spitch AI

Amharic TTS model adapted from XTTS-v2, trained on approximately 500 hours of data by Spitch AI (Lagos, Nigeria).

Repository Contents

  • checkpoint_232000.pth (GPT checkpoint)
  • dvae.pth
  • mel_stats.pth
  • config.json
  • vocab.json
  • requirements.txt
  • amharic_financial_normalizer.py

Quick Setup

pip install -r requirements.txt

Inference (Python)

import pandas as pd
if not hasattr(pd.DataFrame, 'map'):
    pd.DataFrame.map = pd.DataFrame.applymap

import torch
import torchaudio
import epitran
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from amharic_financial_normalizer import normalize_amharic_financial_text

CHECKPOINT_PATH = "checkpoint_232000.pth"
CONFIG_PATH = "config.json"
VOCAB_PATH = "vocab.json"
DVAE_PATH = "dvae.pth"
MEL_STATS_PATH = "mel_stats.pth"

config = XttsConfig()
config.load_json(CONFIG_PATH)
config.model_args.dvae_checkpoint = DVAE_PATH
config.model_args.mel_norm_file = MEL_STATS_PATH
config.model_args.vocab_file = VOCAB_PATH

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path=CHECKPOINT_PATH,
    vocab_path=VOCAB_PATH,
    use_deepspeed=False,
)

if torch.cuda.is_available():
    model.cuda()

epi_am = epitran.Epitran("amh-Ethi-pp")


def preprocess_text(text: str, lang: str = "am") -> str:
    if lang == "am":
        # 1) Normalize numbers and money expressions first.
        text = normalize_amharic_financial_text(text)
        # 2) Transliterate Ethiopic script for model robustness.
        text = epi_am.transliterate(text)
    return text


ref_audio = "reference.wav"  # 3-10 seconds, clean speech
text_am = "ወደ 77777 ብር 250.50 ያስገቡ"
processed_text = preprocess_text(text_am, "am")

gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[ref_audio])
out = model.inference(
    text=processed_text,
    language="am",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.5,
    length_penalty=1.0,
    repetition_penalty=2.0,
    top_k=50,
    top_p=0.8,
    enable_text_splitting=True,
)

torchaudio.save("output_amharic.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
print("Saved output_amharic.wav")

Notes

  • normalize_amharic_financial_text(...) handles amounts, decimals, phone-like IDs, and short-codes for TTS-friendly Amharic speech.
  • Best results come from high-quality reference audio.
  • Output sample rate is 24000 Hz.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support