Omni-R1-Zero / README.md
charlesdj's picture
Update README.md
7cbd5e5 verified
---
library_name: transformers
tags:
- multimodal
- reasoning
- sft
- rl
datasets:
- LightChen2333/M3CoT
- ModalityDance/Omni-Bench
base_model:
- GAIR/Anole-7b-v0.1
pipeline_tag: any-to-any
---
# Omni-R1-Zero
[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2601.09536)
[![Code](https://img.shields.io/badge/GitHub-Code-blue?style=for-the-badge&logo=github)](https://github.com/ModalityDance/Omni-R1)
[![Omni-Bench](https://img.shields.io/badge/Dataset-Omni--Bench-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/ModalityDance/Omni-Bench)
## Overview
**Omni-R1-Zero** is trained **without multimodal annotations**. It bootstraps **step-wise visualizations** from **text-only CoT seeds** (e.g., M3CoT), and then follows the same PeSFT+PeRPO recipe as Omni-R1 to learn interleaved multimodal reasoning.
## Usage
```python
import torch
from PIL import Image
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
# 1) Import & load
model_id = "ModalityDance/Omni-R1-Zero" # or a local checkpoint path
processor = ChameleonProcessor.from_pretrained(model_id)
model = ChameleonForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
# 2) Prepare a single input
prompt = "You are a helpful assistant.\nUser: Which of these would appear shinier when polished? A. Metal spoon B. Wooden spoon\nThink with images first, the image reasoning process and answer are enclosed within <reserved12856> <reserved12857> and <reserved12866> <reserved12867> XML tags, respectively.\nAssistant:"
inputs = processor(
prompt,
padding=False,
return_for_text_completion=True,
return_tensors="pt",
).to(model.device)
# 3) Call the model
outputs = model.generate(
**inputs,
max_length=4096,
do_sample=True,
temperature=1.0,
top_p=0.9,
pad_token_id=1,
multimodal_generation_mode="unrestricted",
)
# 4) Get results
text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(text)
```
For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
https://github.com/ModalityDance/Omni-R1
## License
This project is licensed under the **MIT License**.
It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**.
## Citation
```bibtex
@misc{cheng2026omnir1unifiedgenerativeparadigm,
title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
year={2026},
eprint={2601.09536},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.09536},
}
```