--- library_name: transformers tags: - multimodal - reasoning - sft - rl datasets: - LightChen2333/M3CoT - ModalityDance/Omni-Bench base_model: - GAIR/Anole-7b-v0.1 pipeline_tag: any-to-any --- # Omni-R1-Zero [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2601.09536) [![Code](https://img.shields.io/badge/GitHub-Code-blue?style=for-the-badge&logo=github)](https://github.com/ModalityDance/Omni-R1) [![Omni-Bench](https://img.shields.io/badge/Dataset-Omni--Bench-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/ModalityDance/Omni-Bench) ## Overview **Omni-R1-Zero** is trained **without multimodal annotations**. It bootstraps **step-wise visualizations** from **text-only CoT seeds** (e.g., M3CoT), and then follows the same PeSFT+PeRPO recipe as Omni-R1 to learn interleaved multimodal reasoning. ## Usage ```python import torch from PIL import Image from transformers import ChameleonProcessor, ChameleonForConditionalGeneration # 1) Import & load model_id = "ModalityDance/Omni-R1-Zero" # or a local checkpoint path processor = ChameleonProcessor.from_pretrained(model_id) model = ChameleonForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) model.eval() # 2) Prepare a single input prompt = "You are a helpful assistant.\nUser: Which of these would appear shinier when polished? A. Metal spoon B. Wooden spoon\nThink with images first, the image reasoning process and answer are enclosed within and XML tags, respectively.\nAssistant:" inputs = processor( prompt, padding=False, return_for_text_completion=True, return_tensors="pt", ).to(model.device) # 3) Call the model outputs = model.generate( **inputs, max_length=4096, do_sample=True, temperature=1.0, top_p=0.9, pad_token_id=1, multimodal_generation_mode="unrestricted", ) # 4) Get results text = processor.batch_decode(outputs, skip_special_tokens=False)[0] print(text) ``` For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository: https://github.com/ModalityDance/Omni-R1 ## License This project is licensed under the **MIT License**. It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**. ## Citation ```bibtex @misc{cheng2026omnir1unifiedgenerativeparadigm, title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li}, year={2026}, eprint={2601.09536}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.09536}, } ```