---
base_model:
- Qwen/Qwen3-4B-Instruct-2507
library_name: transformers
pipeline_tag: text-generation
tags:
- audio
- speech
- audio-codec
- neural-audio-codec
- spoken-language-modeling
- codec-superb
- qwen3
datasets:
- librispeech_asr
metrics:
- perplexity
- pesq
- stoi
---

# LLM-Codec

LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio
tokens that are both reconstructable and easier for autoregressive language
models to predict.

Model: https://huggingface.co/voidful/llm-codec

Code: https://github.com/voidful/llm-codec

Usage reference: https://github.com/voidful/Codec-SUPERB

## Model Description

Most neural audio codecs are trained for waveform reconstruction. Spoken
language models, however, consume codec tokens with a next-token prediction
objective. This mismatch can make acoustically valid variation appear as token
uncertainty to the language model.

LLM-Codec adapts a codec with language-model-facing objectives while keeping the
deployed codec interface unchanged. The model is trained with:

- Future Token Prediction (FTP): Medusa-style heads predict future audio tokens
  from frozen-LLM hidden states.
- Semantic Alignment (SA): audio-induced hidden states are aligned with paired
  text hidden states inside a frozen LLM.
- Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward
  tokens while enabling gradients to flow to the codec encoder.
- Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex
  STFT, VQ, GAN, and feature matching losses.

The deployed codec does not require the auxiliary FTP heads.

## Intended Use

This model is intended for research and development in:

- audio tokenization for spoken language modeling
- codec reconstruction experiments
- token-level speech LM training
- Codec-SUPERB style codec evaluation
- speech token analysis and ablation studies

It is not a full text-to-speech system by itself. For speech generation, use the
codec as the tokenizer/decoder inside a separate speech language modeling
pipeline.

## Out-of-Scope Use

Do not use this model for:

- impersonation or unauthorized voice cloning
- surveillance or speaker tracking without consent
- high-stakes speaker, language, or identity decisions
- generating deceptive audio content

## Installation

The easiest inference path is through the Codec-SUPERB `SoundCodec` interface.

```bash
git clone https://github.com/voidful/Codec-SUPERB.git
cd Codec-SUPERB
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
```

If your environment supports editable installs, this is also convenient:

```bash
pip install -e .
```

## Quick Start

Load LLM-Codec through the Codec-SUPERB codec registry:

```python
from SoundCodec import codec

print(codec.list_codec())
model = codec.load_codec("llmcodec")
```

Encode and reconstruct one audio file:

```python
from SoundCodec import codec
import torchaudio
import soundfile as sf

model = codec.load_codec("llmcodec")

waveform, sample_rate = torchaudio.load("sample_audio.wav")
data_item = {
    "audio": {
        "array": waveform.numpy()[0],
        "sampling_rate": sample_rate,
    }
}

units = model.extract_unit(data_item).unit
print("Unit shape:", units.shape)

result = model.synth(data_item, local_save=False)
reconstructed = result["audio"]["array"]
reconstructed_sr = result["audio"].get("sampling_rate", sample_rate)

sf.write("reconstructed.wav", reconstructed, reconstructed_sr)
```

## Batch Usage

Codec-SUPERB also provides batch APIs:

```python
from SoundCodec import codec
import torchaudio

model = codec.load_codec("llmcodec")

audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
data_list = []

for path in audio_files:
    waveform, sample_rate = torchaudio.load(path)
    data_list.append({
        "id": path,
        "audio": {
            "array": waveform.numpy()[0],
            "sampling_rate": sample_rate,
        },
    })

batch_units = model.batch_extract_unit(data_list)
batch_audio = model.batch_decode_unit(batch_units)

results = model.batch_synth(data_list, local_save=False)
for item in results:
    print(item["unit"].shape, item["audio"]["array"].shape)
```

For better throughput, group audio samples with similar lengths before batching.

## Codec-SUPERB Evaluation

To evaluate LLM-Codec with Codec-SUPERB-tiny:

```bash
PYTHONPATH=. python3 scripts/dataset_creator.py \
  --dataset voidful/codec-superb-tiny

PYTHONPATH=. python3 scripts/benchmarking.py \
  --dataset datasets/voidful/codec-superb-tiny_synth \
  --models llmcodec
```

## Model Files

The model repository provides:

- codec weights as `llm-codec.pt`
- a tokenizer extended with `<CODEC_*>` audio tokens
- Qwen-compatible model artifacts containing trained audio-token embeddings

The codec uses 20,480 audio tokens with the canonical token format:

```text
<CODEC_0>, <CODEC_1>, ..., <CODEC_20479>
```

## Training Data

The codec was trained on LibriSpeech `train-clean-100` with paired transcripts.
The validation split used during training is LibriSpeech `validation`.

Because training is speech-centric and transcript-supervised, performance may be
weaker on non-English speech, conversational speech, music, environmental audio,
or audio with strong noise and overlap.

## Training Procedure

Base components:

- Base codec: AUV
- Frozen LLM backbone: Qwen3-4B-Instruct
- Token rate: 50 Hz
- Audio vocabulary size: 20,480
- Segment length: 4 seconds

Losses:

- reconstruction mel loss
- multi-scale mel loss
- multi-resolution STFT loss
- complex STFT loss with phase term
- VQ commitment loss
- Gumbel bridge cross entropy
- Future Token Prediction loss
- Semantic Alignment cosine loss
- Semantic Alignment contrastive loss with memory bank
- MPD/MSD GAN and feature matching losses

## Evaluation Results

### Token Learnability

SALMon speech coherence accuracy after token-level LM training:

| Tokenizer | Overall accuracy |
| --- | ---: |
| WavTok-L | 48.3 |
| BigCodec | 49.4 |
| UniCodec | 50.1 |
| AUV | 49.4 |
| LLM-Codec | 61.6 |

Token-level perplexity on LibriSpeech after 3 epochs of LM training:

| Tokenizer | Eval loss | Perplexity |
| --- | ---: | ---: |
| WavTok-L | 11.91 | 148,122 |
| UniCodec | 11.92 | 150,197 |
| BigCodec | 11.96 | 156,448 |
| AUV | 11.98 | 159,768 |
| LLM-Codec | 8.44 | 4,617 |

### Reconstruction Quality

Codec-SUPERB-tiny speech reconstruction:

| Model | Mel lower is better | STFT lower is better | PESQ higher is better | STOI higher is better |
| --- | ---: | ---: | ---: | ---: |
| AUV base | 0.762 | 1.648 | 2.094 | 0.850 |
| LLM-Codec | 0.724 | 1.599 | 2.102 | 0.859 |

## Limitations

- The semantic alignment objective depends on paired speech and text.
- The model is primarily validated on read speech.
- Downstream generation quality depends on the separate speech language model.
- The model may preserve speaker identity information present in the input.
- The Hugging Face `transformers` artifacts are not a standalone text chatbot;
  they accompany the codec/tokenizer workflow.

## Citation

```bibtex
@article{chung2026llm,
  title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives},
  author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2604.17852},
  note = {Model and code available at https://github.com/voidful/llm-codec},
  year={2026}
}
```

If you use the Codec-SUPERB interface or benchmark, please also cite
Codec-SUPERB:

```bibtex
@inproceedings{wu-etal-2024-codec,
  title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
  author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
  year = {2024},
  url = {https://aclanthology.org/2024.findings-acl.616},
  doi = {10.18653/v1/2024.findings-acl.616},
  pages = {10330--10348}
}
```