LLM-Codec
LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio tokens that are both reconstructable and easier for autoregressive language models to predict.
Model: https://huggingface.co/voidful/llm-codec
Code: https://github.com/voidful/llm-codec
Usage reference: https://github.com/voidful/Codec-SUPERB
Model Description
Most neural audio codecs are trained for waveform reconstruction. Spoken language models, however, consume codec tokens with a next-token prediction objective. This mismatch can make acoustically valid variation appear as token uncertainty to the language model.
LLM-Codec adapts a codec with language-model-facing objectives while keeping the deployed codec interface unchanged. The model is trained with:
- Future Token Prediction (FTP): Medusa-style heads predict future audio tokens from frozen-LLM hidden states.
- Semantic Alignment (SA): audio-induced hidden states are aligned with paired text hidden states inside a frozen LLM.
- Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward tokens while enabling gradients to flow to the codec encoder.
- Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex STFT, VQ, GAN, and feature matching losses.
The deployed codec does not require the auxiliary FTP heads.
Intended Use
This model is intended for research and development in:
- audio tokenization for spoken language modeling
- codec reconstruction experiments
- token-level speech LM training
- Codec-SUPERB style codec evaluation
- speech token analysis and ablation studies
It is not a full text-to-speech system by itself. For speech generation, use the codec as the tokenizer/decoder inside a separate speech language modeling pipeline.
Out-of-Scope Use
Do not use this model for:
- impersonation or unauthorized voice cloning
- surveillance or speaker tracking without consent
- high-stakes speaker, language, or identity decisions
- generating deceptive audio content
Installation
The easiest inference path is through the Codec-SUPERB SoundCodec interface.
git clone https://github.com/voidful/Codec-SUPERB.git
cd Codec-SUPERB
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
If your environment supports editable installs, this is also convenient:
pip install -e .
Quick Start
Load LLM-Codec through the Codec-SUPERB codec registry:
from SoundCodec import codec
print(codec.list_codec())
model = codec.load_codec("llmcodec")
Encode and reconstruct one audio file:
from SoundCodec import codec
import torchaudio
import soundfile as sf
model = codec.load_codec("llmcodec")
waveform, sample_rate = torchaudio.load("sample_audio.wav")
data_item = {
"audio": {
"array": waveform.numpy()[0],
"sampling_rate": sample_rate,
}
}
units = model.extract_unit(data_item).unit
print("Unit shape:", units.shape)
result = model.synth(data_item, local_save=False)
reconstructed = result["audio"]["array"]
reconstructed_sr = result["audio"].get("sampling_rate", sample_rate)
sf.write("reconstructed.wav", reconstructed, reconstructed_sr)
Batch Usage
Codec-SUPERB also provides batch APIs:
from SoundCodec import codec
import torchaudio
model = codec.load_codec("llmcodec")
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
data_list = []
for path in audio_files:
waveform, sample_rate = torchaudio.load(path)
data_list.append({
"id": path,
"audio": {
"array": waveform.numpy()[0],
"sampling_rate": sample_rate,
},
})
batch_units = model.batch_extract_unit(data_list)
batch_audio = model.batch_decode_unit(batch_units)
results = model.batch_synth(data_list, local_save=False)
for item in results:
print(item["unit"].shape, item["audio"]["array"].shape)
For better throughput, group audio samples with similar lengths before batching.
Codec-SUPERB Evaluation
To evaluate LLM-Codec with Codec-SUPERB-tiny:
PYTHONPATH=. python3 scripts/dataset_creator.py \
--dataset voidful/codec-superb-tiny
PYTHONPATH=. python3 scripts/benchmarking.py \
--dataset datasets/voidful/codec-superb-tiny_synth \
--models llmcodec
Model Files
The model repository provides:
- codec weights as
llm-codec.pt - a tokenizer extended with
<CODEC_*>audio tokens - Qwen-compatible model artifacts containing trained audio-token embeddings
The codec uses 20,480 audio tokens with the canonical token format:
<CODEC_0>, <CODEC_1>, ..., <CODEC_20479>
Training Data
The codec was trained on LibriSpeech train-clean-100 with paired transcripts.
The validation split used during training is LibriSpeech validation.
Because training is speech-centric and transcript-supervised, performance may be weaker on non-English speech, conversational speech, music, environmental audio, or audio with strong noise and overlap.
Training Procedure
Base components:
- Base codec: AUV
- Frozen LLM backbone: Qwen3-4B-Instruct
- Token rate: 50 Hz
- Audio vocabulary size: 20,480
- Segment length: 4 seconds
Losses:
- reconstruction mel loss
- multi-scale mel loss
- multi-resolution STFT loss
- complex STFT loss with phase term
- VQ commitment loss
- Gumbel bridge cross entropy
- Future Token Prediction loss
- Semantic Alignment cosine loss
- Semantic Alignment contrastive loss with memory bank
- MPD/MSD GAN and feature matching losses
Evaluation Results
Token Learnability
SALMon speech coherence accuracy after token-level LM training:
| Tokenizer | Overall accuracy |
|---|---|
| WavTok-L | 48.3 |
| BigCodec | 49.4 |
| UniCodec | 50.1 |
| AUV | 49.4 |
| LLM-Codec | 61.6 |
Token-level perplexity on LibriSpeech after 3 epochs of LM training:
| Tokenizer | Eval loss | Perplexity |
|---|---|---|
| WavTok-L | 11.91 | 148,122 |
| UniCodec | 11.92 | 150,197 |
| BigCodec | 11.96 | 156,448 |
| AUV | 11.98 | 159,768 |
| LLM-Codec | 8.44 | 4,617 |
Reconstruction Quality
Codec-SUPERB-tiny speech reconstruction:
| Model | Mel lower is better | STFT lower is better | PESQ higher is better | STOI higher is better |
|---|---|---|---|---|
| AUV base | 0.762 | 1.648 | 2.094 | 0.850 |
| LLM-Codec | 0.724 | 1.599 | 2.102 | 0.859 |
Limitations
- The semantic alignment objective depends on paired speech and text.
- The model is primarily validated on read speech.
- Downstream generation quality depends on the separate speech language model.
- The model may preserve speaker identity information present in the input.
- The Hugging Face
transformersartifacts are not a standalone text chatbot; they accompany the codec/tokenizer workflow.
Citation
@article{chung2026llm,
title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives},
author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi},
journal={arXiv preprint arXiv:2604.17852},
note = {Model and code available at https://github.com/voidful/llm-codec},
year={2026}
}
If you use the Codec-SUPERB interface or benchmark, please also cite Codec-SUPERB:
@inproceedings{wu-etal-2024-codec,
title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
year = {2024},
url = {https://aclanthology.org/2024.findings-acl.616},
doi = {10.18653/v1/2024.findings-acl.616},
pages = {10330--10348}
}
- Downloads last month
- 1,537