LLM-Codec

LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio tokens that are both reconstructable and easier for autoregressive language models to predict.

Model: https://huggingface.co/voidful/llm-codec

Code: https://github.com/voidful/llm-codec

Usage reference: https://github.com/voidful/Codec-SUPERB

Model Description

Most neural audio codecs are trained for waveform reconstruction. Spoken language models, however, consume codec tokens with a next-token prediction objective. This mismatch can make acoustically valid variation appear as token uncertainty to the language model.

LLM-Codec adapts a codec with language-model-facing objectives while keeping the deployed codec interface unchanged. The model is trained with:

Future Token Prediction (FTP): Medusa-style heads predict future audio tokens from frozen-LLM hidden states.
Semantic Alignment (SA): audio-induced hidden states are aligned with paired text hidden states inside a frozen LLM.
Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward tokens while enabling gradients to flow to the codec encoder.
Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex STFT, VQ, GAN, and feature matching losses.

The deployed codec does not require the auxiliary FTP heads.

Intended Use

This model is intended for research and development in:

audio tokenization for spoken language modeling
codec reconstruction experiments
token-level speech LM training
Codec-SUPERB style codec evaluation
speech token analysis and ablation studies

It is not a full text-to-speech system by itself. For speech generation, use the codec as the tokenizer/decoder inside a separate speech language modeling pipeline.

Out-of-Scope Use

Do not use this model for:

impersonation or unauthorized voice cloning
surveillance or speaker tracking without consent
high-stakes speaker, language, or identity decisions
generating deceptive audio content

Installation

The easiest inference path is through the Codec-SUPERB SoundCodec interface.

git clone https://github.com/voidful/Codec-SUPERB.git
cd Codec-SUPERB
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH

If your environment supports editable installs, this is also convenient:

pip install -e .

Quick Start

Load LLM-Codec through the Codec-SUPERB codec registry:

from SoundCodec import codec

print(codec.list_codec())
model = codec.load_codec("llmcodec")

Encode and reconstruct one audio file:

from SoundCodec import codec
import torchaudio
import soundfile as sf

model = codec.load_codec("llmcodec")

waveform, sample_rate = torchaudio.load("sample_audio.wav")
data_item = {
    "audio": {
        "array": waveform.numpy()[0],
        "sampling_rate": sample_rate,
    }
}

units = model.extract_unit(data_item).unit
print("Unit shape:", units.shape)

result = model.synth(data_item, local_save=False)
reconstructed = result["audio"]["array"]
reconstructed_sr = result["audio"].get("sampling_rate", sample_rate)

sf.write("reconstructed.wav", reconstructed, reconstructed_sr)

Batch Usage

Codec-SUPERB also provides batch APIs:

from SoundCodec import codec
import torchaudio

model = codec.load_codec("llmcodec")

audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
data_list = []

for path in audio_files:
    waveform, sample_rate = torchaudio.load(path)
    data_list.append({
        "id": path,
        "audio": {
            "array": waveform.numpy()[0],
            "sampling_rate": sample_rate,
        },
    })

batch_units = model.batch_extract_unit(data_list)
batch_audio = model.batch_decode_unit(batch_units)

results = model.batch_synth(data_list, local_save=False)
for item in results:
    print(item["unit"].shape, item["audio"]["array"].shape)

For better throughput, group audio samples with similar lengths before batching.

Codec-SUPERB Evaluation

To evaluate LLM-Codec with Codec-SUPERB-tiny:

PYTHONPATH=. python3 scripts/dataset_creator.py \
  --dataset voidful/codec-superb-tiny

PYTHONPATH=. python3 scripts/benchmarking.py \
  --dataset datasets/voidful/codec-superb-tiny_synth \
  --models llmcodec

Model Files

The model repository provides:

codec weights as llm-codec.pt
a tokenizer extended with <CODEC_*> audio tokens
Qwen-compatible model artifacts containing trained audio-token embeddings

The codec uses 20,480 audio tokens with the canonical token format:

<CODEC_0>, <CODEC_1>, ..., <CODEC_20479>

Training Data

The codec was trained on LibriSpeech train-clean-100 with paired transcripts. The validation split used during training is LibriSpeech validation.

Because training is speech-centric and transcript-supervised, performance may be weaker on non-English speech, conversational speech, music, environmental audio, or audio with strong noise and overlap.

Training Procedure

Base components:

Base codec: AUV
Frozen LLM backbone: Qwen3-4B-Instruct
Token rate: 50 Hz
Audio vocabulary size: 20,480
Segment length: 4 seconds

Losses:

reconstruction mel loss
multi-scale mel loss
multi-resolution STFT loss
complex STFT loss with phase term
VQ commitment loss
Gumbel bridge cross entropy
Future Token Prediction loss
Semantic Alignment cosine loss
Semantic Alignment contrastive loss with memory bank
MPD/MSD GAN and feature matching losses

Evaluation Results

Token Learnability

SALMon speech coherence accuracy after token-level LM training:

Tokenizer	Overall accuracy
WavTok-L	48.3
BigCodec	49.4
UniCodec	50.1
AUV	49.4
LLM-Codec	61.6

Token-level perplexity on LibriSpeech after 3 epochs of LM training:

Tokenizer	Eval loss	Perplexity
WavTok-L	11.91	148,122
UniCodec	11.92	150,197
BigCodec	11.96	156,448
AUV	11.98	159,768
LLM-Codec	8.44	4,617

Reconstruction Quality

Codec-SUPERB-tiny speech reconstruction:

Model	Mel lower is better	STFT lower is better	PESQ higher is better	STOI higher is better
AUV base	0.762	1.648	2.094	0.850
LLM-Codec	0.724	1.599	2.102	0.859

Limitations

The semantic alignment objective depends on paired speech and text.
The model is primarily validated on read speech.
Downstream generation quality depends on the separate speech language model.
The model may preserve speaker identity information present in the input.
The Hugging Face transformers artifacts are not a standalone text chatbot; they accompany the codec/tokenizer workflow.

Citation

@article{chung2026llm,
  title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives},
  author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2604.17852},
  note = {Model and code available at https://github.com/voidful/llm-codec},
  year={2026}
}

If you use the Codec-SUPERB interface or benchmark, please also cite Codec-SUPERB:

@inproceedings{wu-etal-2024-codec,
  title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
  author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
  year = {2024},
  url = {https://aclanthology.org/2024.findings-acl.616},
  doi = {10.18653/v1/2024.findings-acl.616},
  pages = {10330--10348}
}