| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | base_model: |
| | - Qwen/Qwen2.5-7B-Instruct |
| | --- |
| | |
| | # Fast-dLLM v2 (7B) β Efficient Block-Diffusion LLM |
| |
|
| | ## π Introduction |
| |
|
| | Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their **inherent sequential decoding limits inference efficiency**. |
| |
|
| | We present **Fast-dLLM v2** β a carefully designed **block diffusion language model (dLLM)** that efficiently adapts a pretrained AR model (**Qwen2.5-7B-Instruct**) into a diffusion-style decoder for **parallel text generation**. |
| |
|
| | ### β¨ Key Innovations |
| | - **Block Diffusion Mechanism + Complementary Attention Mask** |
| | Enables **blockwise bidirectional context modeling** without sacrificing AR objectives. |
| | - **Hierarchical Caching** |
| | - **Block-level cache**: Stores historical context representations across blocks. |
| | - **Sub-block cache**: Parallel decoding within partially generated blocks. |
| | - **Token Shift Mechanism** |
| | Retains autoregressive characteristics while supporting bidirectional context within blocks. |
| | - **Parallel Decoding Pipeline** |
| | Achieves up to **2.5Γ speedup** over standard AR decoding **without compromising quality**. |
| |
|
| | > π Fast-dLLM v2 uses **only ~1B tokens** for fine-tuning β a **500Γ reduction** vs. full-attention diffusion LLMs (Dream: 580B tokens) β while **matching or surpassing AR baselines** in accuracy. |
| |
|
| |  |
| |
|
| | --- |
| |
|
| | ## π Model Overview |
| | - **Type**: Block Diffusion Language Model (dLLM) |
| | - **Base Model**: `Qwen/Qwen2.5-7B-Instruct` |
| | - **Architecture**: Transformer w/ RoPE, SwiGLU activation, RMSNorm, Attention QKV bias |
| | - **Params**: ~7B |
| | - **Layers**: 28 |
| | - **Attention Heads**: 28 (Q), 4 (KV, GQA) |
| | - **Block Diffusion Size**: 32 tokens |
| | - **Key Feature**: Parallel **block-wise decoding** + **hierarchical caching (block-level & sub-block)** |
| |
|
| | --- |
| |
|
| | ## π¦ Installation |
| | You will need `transformers`, `torch`, and our **custom generation function**: |
| |
|
| | ```bash |
| | pip install transformers torch numpy |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π Quickstart |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_name = "Efficient-Large-Model/Fast_dLLM_7B" |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | torch_dtype="auto", |
| | device_map="auto", |
| | trust_remote_code=True |
| | ) |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | |
| | prompt = "Give me a short introduction to large language model." |
| | messages = [ |
| | {"role": "system", "content": "You are a helpful assistant."}, |
| | {"role": "user", "content": prompt} |
| | ] |
| | |
| | text = tokenizer.apply_chat_template( |
| | messages, |
| | tokenize=False, |
| | add_generation_prompt=True |
| | ) |
| | inputs = tokenizer([text], return_tensors="pt").to(model.device) |
| | |
| | # Fast-dLLM v2 parallel decoding |
| | gen_ids = model.generate( |
| | inputs["input_ids"], |
| | tokenizer=tokenizer, |
| | max_new_tokens=512, |
| | small_block_size=8, |
| | threshold=0.9, |
| | ) |
| | |
| | response = tokenizer.decode( |
| | gen_ids[0][inputs["input_ids"].shape[1]:], |
| | skip_special_tokens=True |
| | ) |
| | print(response) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π Performance & Benchmarks |
| |
|
| | ### βΆ Real-time Throughput |
| | Fast-dLLM v2 offers **up to 2.54Γ higher throughput** than Qwen2.5-7B-Instruct, **without loss in quality**. |
| |
|
| |  |
| |
|
| | --- |
| |
|
| | ### π Benchmark Results |
| | We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks: |
| | HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA). |
| |
|
| | - **1B group**: Fast-dLLM v2 (7B) achieves **best average score: 45.0**. |
| | - **7B group**: Fast-dLLM v2 (7B) achieves **best average score: 60.3**, surpassing LLaDA and Dream models. |
| |
|
| |  |
| |
|
| | --- |
| |
|
| | ## π Citation |
| |
|
| | If you use Fast-dLLM v2 in your research or products, please cite: |
| |
|
| | ```bibtex |
| | @misc{wu2025fastdllmv2efficientblockdiffusion, |
| | title={Fast-dLLM v2: Efficient Block-Diffusion LLM}, |
| | author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie}, |
| | year={2025}, |
| | eprint={2509.26328}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2509.26328}, |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π License |
| | Released under **Apache 2.0**, following the base Qwen2.5 license. |
| |
|
| | --- |
| |
|
| | ## π Resources |
| | - π [Paper](https://arxiv.org/abs/2509.26328) |
| | - π» [Code](https://github.com/NVlabs/Fast-dLLM) |
| | - π€ [HuggingFace Model](https://huggingface.co/Efficient-Large-Model/Fast_dLLM_7B) |