Instructions to use gsarch/ViGoRL-Multiturn-7b-Visual-Search with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use gsarch/ViGoRL-Multiturn-7b-Visual-Search with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="gsarch/ViGoRL-Multiturn-7b-Visual-Search") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("gsarch/ViGoRL-Multiturn-7b-Visual-Search") model = AutoModelForImageTextToText.from_pretrained("gsarch/ViGoRL-Multiturn-7b-Visual-Search") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use gsarch/ViGoRL-Multiturn-7b-Visual-Search with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "gsarch/ViGoRL-Multiturn-7b-Visual-Search" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gsarch/ViGoRL-Multiturn-7b-Visual-Search", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/gsarch/ViGoRL-Multiturn-7b-Visual-Search
- SGLang
How to use gsarch/ViGoRL-Multiturn-7b-Visual-Search with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "gsarch/ViGoRL-Multiturn-7b-Visual-Search" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gsarch/ViGoRL-Multiturn-7b-Visual-Search", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "gsarch/ViGoRL-Multiturn-7b-Visual-Search" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gsarch/ViGoRL-Multiturn-7b-Visual-Search", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use gsarch/ViGoRL-Multiturn-7b-Visual-Search with Docker Model Runner:
docker model run hf.co/gsarch/ViGoRL-Multiturn-7b-Visual-Search
| library_name: transformers | |
| pipeline_tag: image-text-to-text | |
| base_model: | |
| - Qwen/Qwen2.5-VL-7B-Instruct | |
| # ViGoRL: Visually Grounded Reinforcement Learning for Visual Reasoning | |
| This model card describes the ViGoRL (**Vi**sually **G**r**o**unded **R**einforcement **L**earning) model, introduced in our paper ["Grounded Reinforcement Learning for Visual Reasoning"](https://arxiv.org/abs/2505.23678). | |
| **Authors:** Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki | |
| --- | |
| ## Model Overview | |
| ViGoRL is a vision-language model fine-tuned using reinforcement learning (RL) to explicitly anchor textual reasoning steps to visual coordinates. Inspired by human visual cognition, ViGoRL employs multi-turn visual grounding, dynamically zooming into image regions to perform fine-grained visual reasoning and grounding. | |
| This model was trained using supervised fine-tuning (SFT) on visually-grounded reasoning traces generated via Monte Carlo Tree Search (MCTS), followed by reinforcement learning with Group Relative Policy Optimization (GRPO). | |
| --- | |
| ## Model Details | |
| * **Base Architecture:** Qwen2.5-Vision-Language (3B or 7B parameters) | |
| * **Training Paradigm:** | |
| * Supervised Fine-Tuning on MCTS-generated reasoning traces | |
| * Group Relative Policy Optimization (GRPO) | |
| * Multi-turn visual grounding with dynamic zoom-in feedback (if "Multiturn" appears in name) | |
| --- | |
| ## Use Cases | |
| This model excels in visual reasoning tasks that require precise visual grounding and region-level reasoning. Please see model name for specific domain. | |
| * **Spatial Reasoning:** SAT-2, BLINK, RoboSpatial | |
| * **Visual Search:** V\*Bench | |
| * **Web Interaction and Grounding:** ScreenSpot (Pro and V2), VisualWebArena | |
| --- | |
| ## Usage | |
| You can load this model easily using Hugging Face's Transformers library: | |
| ```python | |
| from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor | |
| from qwen_vl_utils import process_vision_info | |
| import torch | |
| # # default: Load the model on the available device(s) | |
| # model = Qwen2_5_VLForConditionalGeneration.from_pretrained( | |
| # "gsarch/ViGoRL-Multiturn-3b-Visual-Search", torch_dtype="auto", device_map="auto" | |
| # ) # replace with any of the ViGoRL models | |
| # We recommend enabling flash_attention_2 for better acceleration and memory saving. | |
| model = Qwen2_5_VLForConditionalGeneration.from_pretrained( | |
| "gsarch/ViGoRL-Multiturn-3b-Visual-Search", | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| device_map="auto", | |
| ) | |
| # default processer | |
| processor = AutoProcessor.from_pretrained("gsarch/ViGoRL-Multiturn-3b-Visual-Search") | |
| # The default range for the number of visual tokens per image in the model is 4-16384. | |
| # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost. | |
| # min_pixels = 256*28*28 | |
| # max_pixels = 1280*28*28 | |
| # processor = AutoProcessor.from_pretrained("gsarch/ViGoRL-Multiturn-3b-Visual-Search", min_pixels=min_pixels, max_pixels=max_pixels) | |
| # messages = [ | |
| # { | |
| # "role": "user", | |
| # "content": [ | |
| # { | |
| # "type": "image", | |
| # "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", | |
| # }, | |
| # {"type": "text", "text": "What color is the leash."}, | |
| # ], | |
| # } | |
| # ] | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| { | |
| "type": "image", | |
| "image": "path/to/image.png", | |
| }, | |
| {"type": "text", "text": "QUERY HERE"}, | |
| ], | |
| } | |
| ] | |
| # Preparation for inference | |
| text = processor.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) | |
| image_inputs, video_inputs = process_vision_info(messages) | |
| inputs = processor( | |
| text=[text], | |
| images=image_inputs, | |
| videos=video_inputs, | |
| padding=True, | |
| return_tensors="pt", | |
| ) | |
| inputs = inputs.to("cuda") | |
| # Inference: Generation of the output | |
| generated_ids = model.generate(**inputs, max_new_tokens=512) | |
| generated_ids_trimmed = [ | |
| out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) | |
| ] | |
| output_text = processor.batch_decode( | |
| generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False | |
| ) | |
| print(output_text) # this will output a single tool call turn of the model if version is multiturn. | |
| # Example output of gsarch/ViGoRL-Multiturn-3b-Visual-Search: ['<think> The leash appears to be red, as seen near the dog\'s paw and the person\'s hand. (1028, 1093). </think>\n<tool_call>\n{"name": "search_coordinate", "arguments": {"coordinate": [1028, 1093]}}\n</tool_call>'] | |
| ``` | |
| **Important**: This model requires a system prompt for proper usage. Please see the model's chat template for details. | |
| --- | |
| ## Datasets and Training Data | |
| Training datasets and generated reasoning chains are publicly available: | |
| * [Code](https://github.com/Gabesarch/grounded-rl) | |
| * [ViGoRL Datasets on Hugging Face](https://huggingface.co/datasets/gsarch/vigorl_datasets) | |
| --- | |
| ## Citation | |
| If you use ViGoRL in your research or applications, please cite our paper: | |
| ```bibtex | |
| @article{sarch2025vigorl, | |
| title={Grounded Reinforcement Learning for Visual Reasoning}, | |
| author={Sarch, Gabriel and Saha, Snigdha and Khandelwal, Naitik and Jain, Ayush and Tarr, Michael J and Kumar, Aviral and Fragkiadaki, Katerina}, | |
| year={2025} | |
| } | |
| ``` | |
| --- | |
| ## Contact | |
| For questions, feedback, or collaborations, please reach out to Gabriel Sarch or open an issue in our [GitHub repository](https://github.com/Gabesarch/grounded-rl). | |
| --- |