Update README.md

478d8b6 verified 4 days ago

5.26 kB

	---
	library_name: transformers
	tags:
	- mdlm
	- diffusion
	license: apache-2.0
	datasets:
	- turkish-nlp-suite/InstrucTurca
	language:
	- tr
	base_model:
	- diffutron/DiffutronLM-0.3B-Alpaca
	pipeline_tag: text-generation
	---

	# DiffutronLM-0.3B-Instruct

	Diffutron is a parameter-efficient, Masked Diffusion Language Model (MDLM) specifically designed for the Turkish language. Unlike standard autoregressive models that generate text one token at a time, Diffutron generates text by iteratively refining sequences in parallel, allowing for simultaneous consideration of the entire sentence context.

	Despite its compact size of 307 million parameters, `DiffutronLM-0.3B-Instruct` achieves highly competitive performance against much larger, multi-billion-parameter autoregressive baselines on Turkish NLP benchmarks.

	## 📌 Model Details

	* Model Type: Masked Diffusion Language Model (MDLM)
	* Base Architecture: `jhu-clsp/mmBERT-base` (Multilingual Encoder)
	* Language: Turkish
	* Parameter Count: 307M (0.3B)
	* Context Length: 256 tokens (Instruct), 512 tokens (Base)
	* Training Libraries: `dllm`, PyTorch

	## 🚀 Architecture & Approach

	Diffutron departs from traditional next-token prediction. It treats text generation as a discrete diffusion process:
	1. Forward Process: Clean text is gradually corrupted into a sequence of `<mask>` tokens.
	2. Reverse Process: The model learns to denoise the sequence globally, attending to visible context bi-directionally to predict the original tokens.

	This non-autoregressive paradigm compresses linguistic knowledge efficiently, allowing this 0.3B model to punch significantly above its weight class.

	## 📚 Training Pipeline

	The model was developed through a resource-efficient, multi-stage training pipeline:

	### 1. Continual Pre-training (CPT)
	To adapt the multilingual backbone to Turkish without catastrophic forgetting, we employed a high-rank LoRA strategy (r=256, α=256) targeting all linear modules (Attention and MLP).
	* Data: ~2 million sequences sourced from Havadis (news), Temiz-OSCAR (web), and Turkish Wikipedia.
	* Result: Perplexity on the Bilkent Turkish Writings Dataset dropped significantly from 3.42 (base) to 2.75.

	### 2. Progressive Instruction-Tuning (SFT)
	To unlock generative instruction-following capabilities, we utilized a two-stage supervised fine-tuning approach:
	* Stage 1 (General Adaptation): Trained on `metunlp/LlamaTurk-Instruction-Set` for 20 epochs to establish fundamental instruction-following behaviors.
	* Stage 2 (Complex Specialization): Trained on the nuanced `turkish-nlp-suite/InstrucTurca` dataset for 8 epochs with an increased batch size, enhancing the model's ability to handle intricate, domain-specific Turkish commands.

	## 📊 Evaluation Results

	The model was evaluated on a representative subset of the CETVEL Benchmark Suite. DiffutronLM-0.3B (2nd Stage) demonstrates remarkable parameter efficiency, outperforming models up to 7x its size (e.g., Kumru-2B and TURNA-1.1B) on average scores.

	\| Benchmark \| Diffutron-1st-Stage (0.3B) \| Diffutron-2nd-Stage (0.3B) \| TURNA (1.1B) \| Kumru (2B) \| Kanarya (2B) \| Llama-3.2 (3B) \| Trendyol (7B) \| Aya-101 (13B) \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Belebele_TR \| 22.22 \| 27.00 \| 22.56 \| 29.00 \| 28.11 \| 55.78 \| 36.22 \| 22.89 \|
	\| EXAMS_TR \| 25.95 \| 27.74 \| 23.66 \| 30.03 \| 30.03 \| 26.21 \| 28.50 \| 22.90 \|
	\| IronyTR \| 50.67 \| 52.00 \| 48.33 \| 51.00 \| 50.00 \| 50.17 \| 50.00 \| 52.17 \|
	\| News_Cat \| 23.20 \| 32.40 \| 32.80 \| 26.40 \| 66.80 \| 64.00 \| 81.20 \| 20.00 \|
	\| MNLI_TR \| 33.29 \| 32.81 \| 34.94 \| 36.42 \| 33.40 \| 34.76 \| 35.19 \| 27.90 \|
	\| STS_TR \| 17.77 \| 18.78 \| 14.21 \| 11.75 \| 12.91 \| 12.91 \| 15.52 \| 16.97 \|
	\| XCOPA_TR \| 53.80 \| 52.00 \| 55.80 \| 54.00 \| 64.20 \| 54.60 \| 61.00 \| 59.60 \|
	\| Average \| 32.41 \| 34.68 \| 33.19 \| 34.09 \| 40.78 \| 42.63 \| 43.95 \| 31.78 \|

	## 💻 Usage

	Because Diffutron is a Masked Diffusion Language Model, it requires inference strategies distinct from standard causal generation. We recommend using the `dllm` library or custom generation loops tailored for discrete diffusion.

	### Generation Parameters Used in Paper:
	* Longer Context: Steps: 128, Temp: 0.1, Block Length: 32, Repetition Penalty: 1.2
	* Shorter Context: Steps: 64, Remask: `low_conf`, Stochastic: `False`, CFG: 0.0

	## ⚠️ Limitations

	* Multilingual Backbone: Built upon a multilingual encoder rather than a native Turkish foundation model.
	* Context Window: Restricted to a 256-token context window for generation, limiting its use in long-form summarization or document-level generation.
	* Data Nuances: Instruction datasets rely heavily on translations or synthetic data, which may occasionally miss subtle cultural contexts.

	## 📝 Citation

	If you use Diffutron in your research, please cite our preprint:

	```bibtex
	@misc{diffutron2026,
	author = {Kocabay, Şuayp Talha and Akkuş, Talha Rüzgar},
	title = {Diffutron: A Masked Diffusion Language Model for Turkish Language},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/collections/diffutron/diffutronlm}}
	}
	```