|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- multimodal |
|
|
- reasoning |
|
|
- sft |
|
|
- rl |
|
|
datasets: |
|
|
- multimodal-reasoning-lab/Zebra-CoT |
|
|
- ModalityDance/Omni-Bench |
|
|
base_model: |
|
|
- GAIR/Anole-7b-v0.1 |
|
|
pipeline_tag: any-to-any |
|
|
--- |
|
|
|
|
|
# Omni-R1 |
|
|
|
|
|
[](https://arxiv.org/abs/2601.09536) |
|
|
[](https://github.com/ModalityDance/Omni-R1) |
|
|
[](https://huggingface.co/datasets/ModalityDance/Omni-Bench) |
|
|
|
|
|
## Overview |
|
|
|
|
|
**Omni-R1** is trained with multimodal interleaved supervision. It uses **PeSFT** for stable functional image generation, then **PeRPO** for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration |
|
|
|
|
|
# 1) Import & load |
|
|
model_id = "ModalityDance/Omni-R1" # or "ModalityDance/Omni-R1-Zero" |
|
|
processor = ChameleonProcessor.from_pretrained(model_id) |
|
|
model = ChameleonForConditionalGeneration.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
) |
|
|
model.eval() |
|
|
|
|
|
# 2) Prepare a single input (prompt contains <image>) |
|
|
prompt = "What is the smiling man in the image wearing? <image>" |
|
|
image = Image.open("image.png").convert("RGB") |
|
|
|
|
|
inputs = processor( |
|
|
prompt, |
|
|
images=[image], |
|
|
padding=False, |
|
|
return_for_text_completion=True, |
|
|
return_tensors="pt", |
|
|
).to(model.device) |
|
|
|
|
|
# --- minimal image token preprocessing: replace <image> placeholder with image tokens --- |
|
|
input_ids = inputs["input_ids"].long() |
|
|
pixel_values = inputs["pixel_values"] |
|
|
|
|
|
placeholder_id = processor.tokenizer.encode("<image>", add_special_tokens=False)[0] |
|
|
image_tokens = model.get_image_tokens(pixel_values) # shape: [1, N] (or compatible) |
|
|
|
|
|
mask = (input_ids == placeholder_id) |
|
|
input_ids = input_ids.clone() |
|
|
input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device) |
|
|
|
|
|
# 3) Call the model |
|
|
outputs = model.generate( |
|
|
input_ids=input_ids, |
|
|
max_length=4096, |
|
|
do_sample=True, |
|
|
temperature=0.5, |
|
|
top_p=0.9, |
|
|
pad_token_id=1, |
|
|
multimodal_generation_mode="unrestricted", |
|
|
) |
|
|
|
|
|
# 4) Get results |
|
|
text = processor.batch_decode(outputs, skip_special_tokens=False)[0] |
|
|
print(text) |
|
|
``` |
|
|
|
|
|
For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository: |
|
|
https://github.com/ModalityDance/Omni-R1 |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the **MIT License**. |
|
|
It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{cheng2026omnir1unifiedgenerativeparadigm, |
|
|
title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, |
|
|
author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li}, |
|
|
year={2026}, |
|
|
eprint={2601.09536}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.AI}, |
|
|
url={https://arxiv.org/abs/2601.09536}, |
|
|
} |
|
|
``` |
|
|
|