Instructions to use DataScience-UIBK/Argus-Colqwen3.5-9b-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DataScience-UIBK/Argus-Colqwen3.5-9b-v0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="DataScience-UIBK/Argus-Colqwen3.5-9b-v0", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("DataScience-UIBK/Argus-Colqwen3.5-9b-v0", trust_remote_code=True, dtype="auto") - ColPali
How to use DataScience-UIBK/Argus-Colqwen3.5-9b-v0 with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Argus-Colqwen3.5-9b-v0 Β· fp32 release
- TL;DR β leaderboard standing
- What is novel here
- Model details
- Performance β ViDoRe v1 (English, nDCG@5, 10 tasks)
- Performance β ViDoRe v2 (English, nDCG@5, 4 tasks)
- ViDoRe v3
- Storage cost
- Installation
- Usage β text + image retrieval
- Reproduce the leaderboard ViDoRe results with MTEB
- Reproduce on the official ViDoRe-benchmark library
- Training
- Limitations
- License
- Citation
- Contact
- TL;DR β leaderboard standing
Argus-Colqwen3.5-9b-v0 Β· fp32 release
Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck β Data Science group Β· 2026
DataScience-UIBK/Argus-Colqwen3.5-9b-v0 is an 8.8-billion-parameter visual-document retriever built on Qwen3.5-VL-9B-Instruct. It uses a ColPali-style multi-vector (MaxSim) late-interaction head, and replaces the dense projection with a query-conditioned latent mixture of experts (MoE) that routes regions of visual tokens through one of four specialists conditioned on the query.
This is the fp32 merged release β the LoRA adapter is folded into the base in float32 to preserve trained precision. A bfloat16 companion lives at DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 for memory-constrained deployment. Smaller siblings: 4B fp32, 2B fp32.
TL;DR β leaderboard standing
- Co-leads the ViDoRe v1 leaderboard at V1 = 0.9267 β tied with
nvidia/nemotron-vl-8b-v2(0.927) within rounding noise, ahead of every other public retriever. - Best Argus result on ViDoRe v2 (V2 = 0.6915), a +0.05 jump over the 4B sibling and well ahead of the strongest 4B-class peers.
- 8.8 B parameters, 1024-d per-token embedding, β€ 2048 visual tokens / page β fits on a single 24 GB GPU at bf16 inference.
- Apache 2.0, trained on public ViDoRe + VDR-Multilingual subsets only.
What is novel here
Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:
- Region pooling. Visual tokens from the backbone are grouped into 4-token regions, giving the router a coarser but spatially-aware view of the page.
- Query-conditioned latent gating (
GateScalars). The router input isregion + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware β e.g. a financial-numbers query routes through a different expert than a layout query, even on the exact same page. - Sparse top-k=2 of 4 latent specialists, fused with the always-on shared dense expert via two learnable gating scalars:
final = base + sigmoid(g_s)Β·shared_out + sigmoid(g_e)Β·specialist_out. - Region-aware load balancing. Auxiliary losses combine load balance + KL-uniform + 0.01Β·router-zΒ² to keep all 4 experts useful and suppress routing collapse.
- 3-stage curriculum. (a) Dense baseline (no MoE, also serves as teacher) β (b) MoE balance warmup (gates frozen, no PEFT, just stop expert collapse) β (c) joint retrieval with KL distillation from the dense baseline (
distillation_weight=0.5).
For the 9B release, the joint stage was extended on the larger VDR1.5M + Docmatix mixture (vdr_docmatix_full), giving the MoE more diverse layouts to specialise on.
The router sits near the top of the backbone (layer β5) so the gating decision is informed by deep visual semantics rather than raw patch features.
Model details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-VL-9B-Instruct |
| Total parameters | 8.82 B |
| Per-token embedding dim | 1024 |
| Max visual tokens / page | 2048 |
| Max text tokens | 32 768 |
| Similarity function | MaxSim (ColBERT / ColPali-style late interaction) |
| MoE specialists | 4 latent + 1 shared dense |
| Top-k experts per token | 2 |
| Region size (visual chunking) | 4 (so each region = 4 visual tokens) |
| Router placement | backbone layer β5 |
| Routing aux losses | load balance + KL-uniform + 0.01 Β· router-zΒ² |
| Weight precision (this release) | float32 |
| License | Apache 2.0 |
| Model size on disk | ~33 GB |
| VRAM @ bf16 inference | ~17 GB |
Performance β ViDoRe v1 (English, nDCG@5, 10 tasks)
Per-task scores measured with the official mteb 2.12 library on the published weights, side-by-side with every Argus sibling for transparency.
| Task | 2B fp32 | 2B bf16 | 4B fp32 | 4B bf16 | 9B fp32 (this) | 9B bf16 |
|---|---|---|---|---|---|---|
| ArxivQA | 0.9027 | 0.9027 | 0.9095 | 0.9126 | 0.9228 | 0.9217 |
| DocVQA | 0.6747 | 0.6747 | 0.6770 | 0.6779 | 0.6809 | 0.6826 |
| InfoVQA | 0.9497 | 0.9497 | 0.9463 | 0.9447 | 0.9426 | 0.9449 |
| ShiftProject | 0.9133 | 0.9133 | 0.9470 | 0.9346 | 0.9365 | 0.9298 |
| SyntheticDocQA-AI | 0.9963 | 0.9963 | 0.9963 | 0.9926 | 0.9963 | 0.9926 |
| SyntheticDocQA-Energy | 0.9726 | 0.9726 | 0.9789 | 0.9750 | 0.9732 | 0.9769 |
| SyntheticDocQA-Government | 0.9729 | 0.9729 | 0.9779 | 0.9779 | 0.9889 | 0.9889 |
| SyntheticDocQA-Healthcare | 0.9926 | 0.9926 | 0.9963 | 0.9963 | 0.9963 | 0.9926 |
| TabFQuAD | 0.9336 | 0.9336 | 0.9533 | 0.9544 | 0.9750 | 0.9724 |
| TatDQA | 0.8403 | 0.8403 | 0.8480 | 0.8485 | 0.8545 | 0.8567 |
| Average | 0.9149 | 0.9149 | 0.9230 | 0.9214 | 0.9267 | 0.9259 |
The 9B model leads on 6 of 10 V1 tasks and ties on most of the rest. The 4B sibling still wins on ShiftProject + SyntheticDocQA-Energy (~0.005β0.010 β at noise level). The 2B sibling has a small edge on InfoVQA β likely a regularisation effect on smaller backbones for layout-driven QA.
ViDoRe v1 β overall leaderboard comparison
| Rank | Model | Params | dim | V1 avg |
|---|---|---|---|---|
| 1 | Argus-Colqwen3.5-9b-v0 (this, fp32) | 8.8 B | 1024 | 0.9267 |
| 1 | nvidia/nemotron-vl-8b-v2 | 8.0 B | hidden | 0.927 |
| 3 | Argus-Colqwen3.5-4b-v0 (sibling, fp32) | 4.0 B | 1024 | 0.9230 |
| 4 | nvidia/llama-nemotron-colembed-vl-3b-v2 | 3.0 B | hidden | 0.917 |
| 5 | nvidia/nemotron-colembed-vl-4b-v2 | 4.0 B | hidden | 0.916 |
| 6 | athrael-soju/colqwen3.5-4.5B-v3 | 4.5 B | 320 | 0.915 |
| 7 | OpenSearch-AI/Ops-Colqwen3-4B | 4.0 B | 2560 | 0.914 |
| 8 | Argus-Colqwen3.5-2b-v0 (sibling, fp32) | 2.3 B | 1024 | 0.9149 |
(0.9267 vs 0.927 is +0.0003 β within rounding/eval-noise of a tie. Argus also wins by a clearer margin on V2; see below.)
Performance β ViDoRe v2 (English, nDCG@5, 4 tasks)
| Task | 2B fp32 | 2B bf16 | 4B fp32 | 4B bf16 | 9B fp32 (this) | 9B bf16 |
|---|---|---|---|---|---|---|
| BioMedicalLectures | 0.6499 | 0.6499 | 0.6438 | 0.6349 | 0.6619 | 0.6633 |
| ESGReports-HighLevel | 0.6936 | 0.6936 | 0.6991 | 0.7079 | 0.7905 | 0.7912 |
| ESGReports | 0.5988 | 0.5988 | 0.6218 | 0.6175 | 0.6760 | 0.6764 |
| EconomicsReports | 0.5186 | 0.5186 | 0.5980 | 0.5918 | 0.6377 | 0.6278 |
| Average | 0.6152 | 0.6152 | 0.6407 | 0.6380 | 0.6915 | 0.6897 |
The V2 jump from 4B to 9B (+0.05 on average) is the largest improvement we see across the Argus family β the bigger backbone helps on layout-heavy ESG reports + dense numeric economics pages where the 4B was visibly behind Ops-Colqwen3-4B.
ViDoRe v2 β overall context
| Model | V2 avg |
|---|---|
| Argus-Colqwen3.5-9b-v0 (fp32, this) | 0.6915 |
| Ops-Colqwen3-4B (dim 2560) | 0.687 |
| TomoroAI/tomoro-colqwen3-embed-4b | 0.660 |
| Argus-Colqwen3.5-4b-v0 (sibling, fp32) | 0.6407 |
| Argus-Colqwen3.5-2b-v0 (sibling, fp32) | 0.6152 |
Argus 9B is the first sub-10B retriever to clear V2 = 0.69 while keeping the per-token embedding at 1024-d (vs Ops's 2560-d, a 2.5Γ storage cost).
ViDoRe v3
Not yet evaluated for this release. Numbers will be added in a follow-up commit once the v3 reproducer run completes.
Storage cost
Per-document storage for an indexed corpus, assuming bf16 token embeddings:
| Model | Tokens/page | Dim | Bytes/page |
|---|---|---|---|
| Ops-Colqwen3-4B | 1280 | 2560 | 6.6 MB |
| Argus-Colqwen3.5-9b-v0 | 2048 | 1024 | 4.2 MB |
| Argus-Colqwen3.5-4b-v0 | 2048 | 1024 | 4.2 MB |
| Argus-Colqwen3.5-2b-v0 | 2048 | 1024 | 4.2 MB |
| TomoroAI/tomoro-colqwen3-embed-4b | 1280 | 320 | 0.8 MB |
Per-page corpus storage is identical across the Argus family β the choice is inference cost (9B is the slowest) and GPU memory, not corpus size on disk.
Installation
# Qwen3.5-VL is only in transformers 5.x
pip install "transformers>=5.0.0,<6.0.0"
# MTEB 2.12 ships transformers 4.57.6 by default β upgrade explicitly afterwards
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
# Optional: faster attention on Hopper / Ampere
pip install flash-attn==2.6.3 --no-build-isolation
After upgrading transformers, wipe the cached remote-code modules so the new ones load:
rm -rf ~/.cache/huggingface/modules/transformers_modules
Usage β text + image retrieval
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16 # or torch.float32 for max precision
model = AutoModel.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=DTYPE,
attn_implementation="flash_attention_2", # or None / "sdpa"
device_map=DEVICE,
).eval()
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=2048,
)
queries = [
"What is the company's revenue in 2019?",
"How does the proposed model compare to baselines?",
]
documents = [
Image.open("page_a.png").convert("RGB"),
Image.open("page_b.png").convert("RGB"),
]
q_emb = model.encode_queries(processor, queries)
d_emb = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)
Reproduce the leaderboard ViDoRe results with MTEB
import mteb
m = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 2})
A single H100 80 GB completes the full V1 + V2 run in roughly 6β8 hours for the 9B fp32 (about 2Γ the 4B runtime). Use batch_size=2 for safety; 4 may OOM on 80 GB once activations + KV cache stack up.
Reproduce on the official ViDoRe-benchmark library
pip install vidore-benchmark
vidore-benchmark evaluate-retriever \
--model-class colqwen2 \
--model-name DataScience-UIBK/Argus-Colqwen3.5-9b-v0 \
--collection-name vidore-v1
Training
| Setting | Value |
|---|---|
| Backbone | Qwen/Qwen3.5-VL-9B-Instruct (Apache-2.0) |
| Stage 1 β dense baseline | trains the standard ColPali head; serves as the teacher |
| Stage 2 β MoE balance warmup | gates frozen, no PEFT, short β only goal is to prevent expert collapse |
| Stage 3 β joint retrieval w/ distillation | PEFT on, gates trainable, KL distillation from stage-1 teacher (distillation_weight=0.5); train mix = vdr_docmatix_full (VDR1.5M + Docmatix) |
| LoRA rank | 32 (folded into base for this release via merge_and_unload() in fp32) |
| Datasets | vidore/colpali_train_set + llamaindex/vdr-multilingual-train (subsets) + Docmatix-IR (in-domain) |
| Hardware | 4 Γ NVIDIA H100 80 GB (zen4_0768_h100x4 partition, UIBK LEO5 cluster) |
| Optimiser | AdamW, lr = 5e-5 with linear warmup |
| Precision | bf16 forward / fp32 master + LoRA |
| Effective batch size | 64 |
The merge step that produced this release was run in float32 throughout (merge_and_unload() on the LoRA adapter, then sharded to safetensors). The companion bf16 release ran the same merge in bfloat16 β see the bf16 sibling card.
Limitations
- English-dominant; the multilingual training subset is small and we omit multilingual eval from this release.
- 4 experts Γ top-2 routing adds ~5 % to total inference latency vs the dense backbone (the LLM dominates total cost).
- 9B at bf16 needs ~17 GB VRAM just for weights β single-GPU inference requires a β₯ 24 GB GPU.
- ViDoRe v3 numbers are pending; will be added once the public reproducer run finishes.
License
Apache 2.0, inherited from Qwen3.5-VL-9B-Instruct. You may use, modify, and redistribute this model commercially, with attribution.
Citation
@misc{argus2026,
title = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
author = {DataScience-UIBK team},
year = {2026},
url = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-9b-v0},
}
Contact
- Org: DataScience-UIBK, University of Innsbruck
- Issues: open one on this repo's Community tab.
- Downloads last month
- 27