--- base_model: - Qwen/Qwen3-4B-Instruct-2507 library_name: transformers pipeline_tag: text-generation tags: - audio - speech - audio-codec - neural-audio-codec - spoken-language-modeling - codec-superb - qwen3 datasets: - librispeech_asr metrics: - perplexity - pesq - stoi --- # LLM-Codec LLM-Codec is a neural audio codec checkpoint trained to produce discrete audio tokens that are both reconstructable and easier for autoregressive language models to predict. Model: https://huggingface.co/voidful/llm-codec Code: https://github.com/voidful/llm-codec Usage reference: https://github.com/voidful/Codec-SUPERB ## Model Description Most neural audio codecs are trained for waveform reconstruction. Spoken language models, however, consume codec tokens with a next-token prediction objective. This mismatch can make acoustically valid variation appear as token uncertainty to the language model. LLM-Codec adapts a codec with language-model-facing objectives while keeping the deployed codec interface unchanged. The model is trained with: - Future Token Prediction (FTP): Medusa-style heads predict future audio tokens from frozen-LLM hidden states. - Semantic Alignment (SA): audio-induced hidden states are aligned with paired text hidden states inside a frozen LLM. - Differentiable Gumbel bridge: hard Gumbel-Softmax keeps discrete forward tokens while enabling gradients to flow to the codec encoder. - Reconstruction losses: mel, multi-scale mel, multi-resolution STFT, complex STFT, VQ, GAN, and feature matching losses. The deployed codec does not require the auxiliary FTP heads. ## Intended Use This model is intended for research and development in: - audio tokenization for spoken language modeling - codec reconstruction experiments - token-level speech LM training - Codec-SUPERB style codec evaluation - speech token analysis and ablation studies It is not a full text-to-speech system by itself. For speech generation, use the codec as the tokenizer/decoder inside a separate speech language modeling pipeline. ## Out-of-Scope Use Do not use this model for: - impersonation or unauthorized voice cloning - surveillance or speaker tracking without consent - high-stakes speaker, language, or identity decisions - generating deceptive audio content ## Installation The easiest inference path is through the Codec-SUPERB `SoundCodec` interface. ```bash git clone https://github.com/voidful/Codec-SUPERB.git cd Codec-SUPERB pip install -r requirements.txt export PYTHONPATH=$PWD:$PYTHONPATH ``` If your environment supports editable installs, this is also convenient: ```bash pip install -e . ``` ## Quick Start Load LLM-Codec through the Codec-SUPERB codec registry: ```python from SoundCodec import codec print(codec.list_codec()) model = codec.load_codec("llmcodec") ``` Encode and reconstruct one audio file: ```python from SoundCodec import codec import torchaudio import soundfile as sf model = codec.load_codec("llmcodec") waveform, sample_rate = torchaudio.load("sample_audio.wav") data_item = { "audio": { "array": waveform.numpy()[0], "sampling_rate": sample_rate, } } units = model.extract_unit(data_item).unit print("Unit shape:", units.shape) result = model.synth(data_item, local_save=False) reconstructed = result["audio"]["array"] reconstructed_sr = result["audio"].get("sampling_rate", sample_rate) sf.write("reconstructed.wav", reconstructed, reconstructed_sr) ``` ## Batch Usage Codec-SUPERB also provides batch APIs: ```python from SoundCodec import codec import torchaudio model = codec.load_codec("llmcodec") audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"] data_list = [] for path in audio_files: waveform, sample_rate = torchaudio.load(path) data_list.append({ "id": path, "audio": { "array": waveform.numpy()[0], "sampling_rate": sample_rate, }, }) batch_units = model.batch_extract_unit(data_list) batch_audio = model.batch_decode_unit(batch_units) results = model.batch_synth(data_list, local_save=False) for item in results: print(item["unit"].shape, item["audio"]["array"].shape) ``` For better throughput, group audio samples with similar lengths before batching. ## Codec-SUPERB Evaluation To evaluate LLM-Codec with Codec-SUPERB-tiny: ```bash PYTHONPATH=. python3 scripts/dataset_creator.py \ --dataset voidful/codec-superb-tiny PYTHONPATH=. python3 scripts/benchmarking.py \ --dataset datasets/voidful/codec-superb-tiny_synth \ --models llmcodec ``` ## Model Files The model repository provides: - codec weights as `llm-codec.pt` - a tokenizer extended with `` audio tokens - Qwen-compatible model artifacts containing trained audio-token embeddings The codec uses 20,480 audio tokens with the canonical token format: ```text , , ..., ``` ## Training Data The codec was trained on LibriSpeech `train-clean-100` with paired transcripts. The validation split used during training is LibriSpeech `validation`. Because training is speech-centric and transcript-supervised, performance may be weaker on non-English speech, conversational speech, music, environmental audio, or audio with strong noise and overlap. ## Training Procedure Base components: - Base codec: AUV - Frozen LLM backbone: Qwen3-4B-Instruct - Token rate: 50 Hz - Audio vocabulary size: 20,480 - Segment length: 4 seconds Losses: - reconstruction mel loss - multi-scale mel loss - multi-resolution STFT loss - complex STFT loss with phase term - VQ commitment loss - Gumbel bridge cross entropy - Future Token Prediction loss - Semantic Alignment cosine loss - Semantic Alignment contrastive loss with memory bank - MPD/MSD GAN and feature matching losses ## Evaluation Results ### Token Learnability SALMon speech coherence accuracy after token-level LM training: | Tokenizer | Overall accuracy | | --- | ---: | | WavTok-L | 48.3 | | BigCodec | 49.4 | | UniCodec | 50.1 | | AUV | 49.4 | | LLM-Codec | 61.6 | Token-level perplexity on LibriSpeech after 3 epochs of LM training: | Tokenizer | Eval loss | Perplexity | | --- | ---: | ---: | | WavTok-L | 11.91 | 148,122 | | UniCodec | 11.92 | 150,197 | | BigCodec | 11.96 | 156,448 | | AUV | 11.98 | 159,768 | | LLM-Codec | 8.44 | 4,617 | ### Reconstruction Quality Codec-SUPERB-tiny speech reconstruction: | Model | Mel lower is better | STFT lower is better | PESQ higher is better | STOI higher is better | | --- | ---: | ---: | ---: | ---: | | AUV base | 0.762 | 1.648 | 2.094 | 0.850 | | LLM-Codec | 0.724 | 1.599 | 2.102 | 0.859 | ## Limitations - The semantic alignment objective depends on paired speech and text. - The model is primarily validated on read speech. - Downstream generation quality depends on the separate speech language model. - The model may preserve speaker identity information present in the input. - The Hugging Face `transformers` artifacts are not a standalone text chatbot; they accompany the codec/tokenizer workflow. ## Citation ```bibtex @article{chung2026llm, title={LLM-Codec: Neural Audio Codec Meets Language Model Objectives}, author={Chung, Ho-Lam and Chen, Yiming and Lee, Hung-yi}, journal={arXiv preprint arXiv:2604.17852}, note = {Model and code available at https://github.com/voidful/llm-codec}, year={2026} } ``` If you use the Codec-SUPERB interface or benchmark, please also cite Codec-SUPERB: ```bibtex @inproceedings{wu-etal-2024-codec, title = {Codec-SUPERB: An In-Depth Analysis of Sound Codec Models}, author = {Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander and Lee, Hung-yi}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2024}, year = {2024}, url = {https://aclanthology.org/2024.findings-acl.616}, doi = {10.18653/v1/2024.findings-acl.616}, pages = {10330--10348} } ```