Any-to-Any
Transformers
Safetensors
chameleon
image-to-text
multimodal
reasoning
sft
rl

Omni-R1

Paper Code Omni-Bench

Overview

Omni-R1 is trained with multimodal interleaved supervision. It uses PeSFT for stable functional image generation, then PeRPO for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories.

Usage

import torch
from PIL import Image
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration

# 1) Import & load
model_id = "ModalityDance/Omni-R1"  # or "ModalityDance/Omni-R1-Zero"
processor = ChameleonProcessor.from_pretrained(model_id)
model = ChameleonForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

# 2) Prepare a single input (prompt contains <image>)
prompt = "What is the smiling man in the image wearing? <image>"
image = Image.open("image.png").convert("RGB")

inputs = processor(
    prompt,
    images=[image],
    padding=False,
    return_for_text_completion=True,
    return_tensors="pt",
).to(model.device)

# --- minimal image token preprocessing: replace <image> placeholder with image tokens ---
input_ids = inputs["input_ids"].long()
pixel_values = inputs["pixel_values"]

placeholder_id = processor.tokenizer.encode("<image>", add_special_tokens=False)[0]
image_tokens = model.get_image_tokens(pixel_values)  # shape: [1, N] (or compatible)

mask = (input_ids == placeholder_id)
input_ids = input_ids.clone()
input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device)

# 3) Call the model
outputs = model.generate(
    input_ids=input_ids,
    max_length=4096,
    do_sample=True,
    temperature=0.5,
    top_p=0.9,
    pad_token_id=1,
    multimodal_generation_mode="unrestricted",
)

# 4) Get results
text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(text)

For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
https://github.com/ModalityDance/Omni-R1

License

This project is licensed under the MIT License.
It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.

Citation

@misc{cheng2026omnir1unifiedgenerativeparadigm,
      title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, 
      author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
      year={2026},
      eprint={2601.09536},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.09536}, 
}
Downloads last month
1,496
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for ModalityDance/Omni-R1

Base model

GAIR/Anole-7b-v0.1
Finetuned
(4)
this model
Quantizations
2 models

Datasets used to train ModalityDance/Omni-R1

Collection including ModalityDance/Omni-R1

Paper for ModalityDance/Omni-R1