Upload folder using huggingface_hub

85c2ed2 verified 4 days ago

7.18 kB

	---
	license: apache-2.0
	datasets:
	- Alex11556666/Reason_Tuning
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	pipeline_tag: text-to-image
	---

	# 💡 DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

	This is the diffusers-compatible version of [DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0). The model weights are stored in safetensors format with a self-contained pipeline script (`deepgen_pipeline.py`) — no need to clone the DeepGen repository.

	DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities—general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger.

	## 🛠️ Quick Start

	### Installation

	```bash
	pip install torch diffusers transformers safetensors einops accelerate huggingface_hub
	# Flash Attention (recommended)
	pip install flash-attn --no-build-isolation
	```

	### Load Pipeline

	```python
	import torch
	from diffusers import DiffusionPipeline

	pipe = DiffusionPipeline.from_pretrained(
	"deepgenteam/DeepGen-1.0-diffusers",
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	)
	pipe.to("cuda")

	# Optional: enable CPU offload for GPUs with limited memory (< 24GB)
	# pipe.enable_model_cpu_offload()
	```

	### Text-to-Image

	```python
	result = pipe(
	prompt="a racoon holding a shiny red apple over its head",
	height=512, width=512,
	num_inference_steps=50,
	guidance_scale=4.0,
	seed=42,
	)
	result.images[0].save("output.png")
	```

	### Image Editing

	```python
	from PIL import Image

	source_image = Image.open("guitar.png").convert("RGB")
	result = pipe(
	prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.",
	image=source_image,
	height=512, width=512,
	num_inference_steps=50,
	guidance_scale=4.0,
	seed=42,
	)
	result.images[0].save("edited.png")
	```

	## 📋 Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `prompt` \| required \| Text prompt for generation or editing \|
	\| `image` \| `None` \| Input image for editing. If `None`, performs text-to-image generation \|
	\| `height` \| 512 \| Output image height \|
	\| `width` \| 512 \| Output image width \|
	\| `num_inference_steps` \| 50 \| Number of denoising steps \|
	\| `guidance_scale` \| 4.0 \| Classifier-free guidance scale \|
	\| `seed` \| `None` \| Random seed for reproducibility \|
	\| `negative_prompt` \| `""` \| Negative prompt for CFG \|

	## 💾 Memory Requirements

	\| Mode \| VRAM \|
	\|------\|------\|
	\| Full GPU \| ~20 GB \|
	\| CPU Offload (`pipe.enable_model_cpu_offload()`) \| ~14 GB \|

	## 📁 Directory Structure

	```
	DeepGen-1.0-diffusers/
	├── transformer/ # SD3 DiT weights (safetensors)
	├── vae/ # AutoencoderKL weights
	├── connector/ # SCB Connector weights + config
	├── scheduler/ # FlowMatchEulerDiscreteScheduler config
	├── tokenizer/ # Qwen2.5-VL tokenizer
	├── prompt_template.json # Prompt formatting template
	├── model_index.json # Model metadata
	└── deepgen_pipeline.py # Self-contained pipeline script
	```

	> Note: The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct). You can override the VLM path using the `vlm_model_path` parameter in `from_pretrained()`.

	## 🧠 Method

	Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.

	\| Component \| Parameters \| Description \|
	\|-----------\|-----------\|-------------\|
	\| VLM (Qwen2.5-VL-3B) \| 3B \| Visual Language Model for understanding prompts and reference images \|
	\| Connector (SCB) \| ~0.8B \| 6-layer Transformer bridging VLM hidden states to DiT conditioning \|
	\| DiT (SD3.5M Kontext) \| 2B \| Diffusion Transformer for image generation \|
	\| VAE \| ~80M \| Image encoder/decoder \|

	## 📊 Benchmarks

	### 1. General Image Generation

	\| Model \| Params \| Geneval ↑ \| DPGBench ↑ \| UniGenBench ↑ \|
	\| --------------------- \| ----------- \| ----------- \| ------------ \| ------------- \|
	\| OmniGen2 \| 3B + 4B \| 0.80 \| 83.57 \| 63.09 \|
	\| BAGEL \| 14B \| 0.82 \| 85.10 \| 61.53 \|
	\| X-Omni \| 7B + 12B \| 0.83 \| 87.65🥉 \| 53.77 \|
	\| Lumina-DiMOO \| 8B \| 0.88🥇 \| 86.04 \| 71.12 \|
	\| Hunyuan-Image-3.0 \| 80B \| 0.72 \| 86.10 \| — \|
	\| Qwen-Image \| 7B + 20B \| 0.87 🥈 \| 88.32 🥇 \| 78.81 🥇 \|
	\| LongCat-Image \| 7B + 6B \| 0.87 🥈 \| 86.80 \| — \|
	\| Z-Image-Turbo \| 4B + 6B \| 0.84 \| 85.15 \| 71.40 \|
	\| GLM-Image \| 9B + 7B \| — \| 84.78 \| — \|
	\| DeepGen 1.0 (SFT) \| 3B + 2B \| 0.86 🥉 \| 87.05 \| 74.18 🥉 \|
	\| DeepGen 1.0 (RL) \| 3B + 2B \| 0.87 🥈 \| 87.90 🥈 \| 75.74 🥈 \|

	### 2. General Image Editing

	\| Model \| Params \| GEdit-EN ↑ \| ImgEdit ↑ \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| BAGEL \| 14B \| 6.52 \| 3.20 \|
	\| Qwen-Image-Edit [2509] \| 7B + 20B \| 7.54 🥈 \| 4.35 🥈 \|
	\| LongCat-Image-Edit \| 7B + 6B \| 7.60 🥇 \| 4.50 🥇 \|
	\| Mammoth2 \| 8B + 3B + 2B \| 6.60 \| 4.06 \|
	\| DeepGen 1.0 (SFT) \| 3B + 2B \| 7.12 \| 4.09 \|
	\| DeepGen 1.0 (RL) \| 3B + 2B \| 7.17 🥉 \| 4.14 🥉 \|

	### 3. Reasoning Image Generation

	\| Model \| Params \| WISE ↑ \| T2I-CoREBench ↑ \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| OmniGen2 \| 3B + 4B \| 0.47 \| 36.1 \|
	\| BAGEL \| 14B \| 0.70 🥉 \| 41.1 \|
	\| Hunyuan-Image-3.0 \| 80B \| 0.57 \| 46.0 \|
	\| Qwen-Image \| 7B + 20B \| 0.62 \| 46.3 🥉 \|
	\| LongCat-Image \| 7B + 6B \| 0.65 \| 52.2 🥇 \|
	\| Z-Image-Turbo \| 4B + 6B \| - \| 43.7 \|
	\| DeepGen 1.0 (SFT) \| 3B + 2B \| 0.72 🥈 \| 45.7 \|
	\| DeepGen 1.0 (RL) \| 3B + 2B \| 0.73 🥇 \| 46.5 🥈 \|

	### 4. Reasoning Image Editing

	\| Model \| Params \| RISE ↑ \| UniREditBench ↑ \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| OmniGen2 \| 3B + 4B \| - \| 43.4 \|
	\| BAGEL \| 14B \| 11.9 🥈 \| 51.0 \|
	\| Qwen-Image-Edit [2509] \| 7B + 20B \| 8.9 \| 56.5 🥉 \|
	\| DeepGen 1.0 (SFT) \| 3B + 2B \| 13.3 🥇 \| 77.5 🥇 \|
	\| DeepGen 1.0 (RL) \| 3B + 2B \| 10.8 🥉 \| 75.7 🥈 \|

	## ⭐ Citation

	```bibtex
	@article{wang2026deepgen,
	title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
	author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
	journal={arXiv preprint arXiv:2602.12205},
	year={2026}
	}
	```

	## License

	Apache 2.0