Argus-Colqwen3.5-9b-v0 · fp32 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-9b-v0 is an 8.8-billion-parameter visual-document retriever built on Qwen3.5-VL-9B-Instruct. It uses a ColPali-style multi-vector (MaxSim) late-interaction head, and replaces the dense projection with a query-conditioned latent mixture of experts (MoE) that routes regions of visual tokens through one of four specialists conditioned on the query.

This is the fp32 merged release — the LoRA adapter is folded into the base in float32 to preserve trained precision. A bfloat16 companion lives at DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 for memory-constrained deployment. Smaller siblings: 4B fp32, 2B fp32.

TL;DR — leaderboard standing

Co-leads the ViDoRe v1 leaderboard at V1 = 0.9267 — tied with nvidia/nemotron-vl-8b-v2 (0.927) within rounding noise, ahead of every other public retriever.
Best Argus result on ViDoRe v2 (V2 = 0.6915), a +0.05 jump over the 4B sibling and well ahead of the strongest 4B-class peers.
8.8 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 24 GB GPU at bf16 inference.
Apache 2.0, trained on public ViDoRe + VDR-Multilingual subsets only.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

Region pooling. Visual tokens from the backbone are grouped into 4-token regions, giving the router a coarser but spatially-aware view of the page.
Query-conditioned latent gating (GateScalars). The router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware — e.g. a financial-numbers query routes through a different expert than a layout query, even on the exact same page.
Sparse top-k=2 of 4 latent specialists, fused with the always-on shared dense expert via two learnable gating scalars: final = base + sigmoid(g_s)·shared_out + sigmoid(g_e)·specialist_out.
Region-aware load balancing. Auxiliary losses combine load balance + KL-uniform + 0.01·router-z² to keep all 4 experts useful and suppress routing collapse.
3-stage curriculum. (a) Dense baseline (no MoE, also serves as teacher) → (b) MoE balance warmup (gates frozen, no PEFT, just stop expert collapse) → (c) joint retrieval with KL distillation from the dense baseline (distillation_weight=0.5).

For the 9B release, the joint stage was extended on the larger VDR1.5M + Docmatix mixture (vdr_docmatix_full), giving the MoE more diverse layouts to specialise on.

The router sits near the top of the backbone (layer −5) so the gating decision is informed by deep visual semantics rather than raw patch features.

Model details

Property	Value
Base model	`Qwen/Qwen3.5-VL-9B-Instruct`
Total parameters	8.82 B
Per-token embedding dim	1024
Max visual tokens / page	2048
Max text tokens	32 768
Similarity function	MaxSim (ColBERT / ColPali-style late interaction)
MoE specialists	4 latent + 1 shared dense
Top-k experts per token	2
Region size (visual chunking)	4 (so each region = 4 visual tokens)
Router placement	backbone layer −5
Routing aux losses	load balance + KL-uniform + 0.01 · router-z²
Weight precision (this release)	float32
License	Apache 2.0
Model size on disk	~33 GB
VRAM @ bf16 inference	~17 GB

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Per-task scores measured with the official mteb 2.12 library on the published weights, side-by-side with every Argus sibling for transparency.

Task	2B fp32	2B bf16	4B fp32	4B bf16	9B fp32 (this)	9B bf16
ArxivQA	0.9027	0.9027	0.9095	0.9126	0.9228	0.9217
DocVQA	0.6747	0.6747	0.6770	0.6779	0.6809	0.6826
InfoVQA	0.9497	0.9497	0.9463	0.9447	0.9426	0.9449
ShiftProject	0.9133	0.9133	0.9470	0.9346	0.9365	0.9298
SyntheticDocQA-AI	0.9963	0.9963	0.9963	0.9926	0.9963	0.9926
SyntheticDocQA-Energy	0.9726	0.9726	0.9789	0.9750	0.9732	0.9769
SyntheticDocQA-Government	0.9729	0.9729	0.9779	0.9779	0.9889	0.9889
SyntheticDocQA-Healthcare	0.9926	0.9926	0.9963	0.9963	0.9963	0.9926
TabFQuAD	0.9336	0.9336	0.9533	0.9544	0.9750	0.9724
TatDQA	0.8403	0.8403	0.8480	0.8485	0.8545	0.8567
Average	0.9149	0.9149	0.9230	0.9214	0.9267	0.9259

The 9B model leads on 6 of 10 V1 tasks and ties on most of the rest. The 4B sibling still wins on ShiftProject + SyntheticDocQA-Energy (~0.005–0.010 — at noise level). The 2B sibling has a small edge on InfoVQA — likely a regularisation effect on smaller backbones for layout-driven QA.

ViDoRe v1 — overall leaderboard comparison

Rank	Model	Params	dim	V1 avg
1	Argus-Colqwen3.5-9b-v0 (this, fp32)	8.8 B	1024	0.9267
1	nvidia/nemotron-vl-8b-v2	8.0 B	hidden	0.927
3	Argus-Colqwen3.5-4b-v0 (sibling, fp32)	4.0 B	1024	0.9230
4	nvidia/llama-nemotron-colembed-vl-3b-v2	3.0 B	hidden	0.917
5	nvidia/nemotron-colembed-vl-4b-v2	4.0 B	hidden	0.916
6	athrael-soju/colqwen3.5-4.5B-v3	4.5 B	320	0.915
7	OpenSearch-AI/Ops-Colqwen3-4B	4.0 B	2560	0.914
8	Argus-Colqwen3.5-2b-v0 (sibling, fp32)	2.3 B	1024	0.9149

(0.9267 vs 0.927 is +0.0003 — within rounding/eval-noise of a tie. Argus also wins by a clearer margin on V2; see below.)

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task	2B fp32	2B bf16	4B fp32	4B bf16	9B fp32 (this)	9B bf16
BioMedicalLectures	0.6499	0.6499	0.6438	0.6349	0.6619	0.6633
ESGReports-HighLevel	0.6936	0.6936	0.6991	0.7079	0.7905	0.7912
ESGReports	0.5988	0.5988	0.6218	0.6175	0.6760	0.6764
EconomicsReports	0.5186	0.5186	0.5980	0.5918	0.6377	0.6278
Average	0.6152	0.6152	0.6407	0.6380	0.6915	0.6897

The V2 jump from 4B to 9B (+0.05 on average) is the largest improvement we see across the Argus family — the bigger backbone helps on layout-heavy ESG reports + dense numeric economics pages where the 4B was visibly behind Ops-Colqwen3-4B.

ViDoRe v2 — overall context

Model	V2 avg
Argus-Colqwen3.5-9b-v0 (fp32, this)	0.6915
Ops-Colqwen3-4B (dim 2560)	0.687
TomoroAI/tomoro-colqwen3-embed-4b	0.660
Argus-Colqwen3.5-4b-v0 (sibling, fp32)	0.6407
Argus-Colqwen3.5-2b-v0 (sibling, fp32)	0.6152

Argus 9B is the first sub-10B retriever to clear V2 = 0.69 while keeping the per-token embedding at 1024-d (vs Ops's 2560-d, a 2.5× storage cost).

ViDoRe v3

Not yet evaluated for this release. Numbers will be added in a follow-up commit once the v3 reproducer run completes.

Storage cost

Per-document storage for an indexed corpus, assuming bf16 token embeddings:

Model	Tokens/page	Dim	Bytes/page
Ops-Colqwen3-4B	1280	2560	6.6 MB
Argus-Colqwen3.5-9b-v0	2048	1024	4.2 MB
Argus-Colqwen3.5-4b-v0	2048	1024	4.2 MB
Argus-Colqwen3.5-2b-v0	2048	1024	4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b	1280	320	0.8 MB

Per-page corpus storage is identical across the Argus family — the choice is inference cost (9B is the slowest) and GPU memory, not corpus size on disk.

Installation

# Qwen3.5-VL is only in transformers 5.x
pip install "transformers>=5.0.0,<6.0.0"

# MTEB 2.12 ships transformers 4.57.6 by default — upgrade explicitly afterwards
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"

# Optional: faster attention on Hopper / Ampere
pip install flash-attn==2.6.3 --no-build-isolation

After upgrading transformers, wipe the cached remote-code modules so the new ones load:

rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage — text + image retrieval

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.bfloat16    # or torch.float32 for max precision

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=DTYPE,
    attn_implementation="flash_attention_2",   # or None / "sdpa"
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)
d_emb  = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)

Reproduce the leaderboard ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 2})

A single H100 80 GB completes the full V1 + V2 run in roughly 6–8 hours for the 9B fp32 (about 2× the 4B runtime). Use batch_size=2 for safety; 4 may OOM on 80 GB once activations + KV cache stack up.

Reproduce on the official ViDoRe-benchmark library

pip install vidore-benchmark
vidore-benchmark evaluate-retriever \
  --model-class colqwen2 \
  --model-name DataScience-UIBK/Argus-Colqwen3.5-9b-v0 \
  --collection-name vidore-v1

Training

Setting	Value
Backbone	`Qwen/Qwen3.5-VL-9B-Instruct` (Apache-2.0)
Stage 1 — dense baseline	trains the standard ColPali head; serves as the teacher
Stage 2 — MoE balance warmup	gates frozen, no PEFT, short — only goal is to prevent expert collapse
Stage 3 — joint retrieval w/ distillation	PEFT on, gates trainable, KL distillation from stage-1 teacher (`distillation_weight=0.5`); train mix = `vdr_docmatix_full` (VDR1.5M + Docmatix)
LoRA rank	32 (folded into base for this release via `merge_and_unload()` in fp32)
Datasets	`vidore/colpali_train_set` + `llamaindex/vdr-multilingual-train` (subsets) + Docmatix-IR (in-domain)
Hardware	4 × NVIDIA H100 80 GB (zen4_0768_h100x4 partition, UIBK LEO5 cluster)
Optimiser	AdamW, lr = 5e-5 with linear warmup
Precision	bf16 forward / fp32 master + LoRA
Effective batch size	64

The merge step that produced this release was run in float32 throughout (merge_and_unload() on the LoRA adapter, then sharded to safetensors). The companion bf16 release ran the same merge in bfloat16 — see the bf16 sibling card.

Limitations

English-dominant; the multilingual training subset is small and we omit multilingual eval from this release.
4 experts × top-2 routing adds ~5 % to total inference latency vs the dense backbone (the LLM dominates total cost).
9B at bf16 needs ~17 GB VRAM just for weights — single-GPU inference requires a ≥ 24 GB GPU.
ViDoRe v3 numbers are pending; will be added once the public reproducer run finishes.

License

Apache 2.0, inherited from Qwen3.5-VL-9B-Instruct. You may use, modify, and redistribute this model commercially, with attribution.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-9b-v0},
}

Contact

Org: DataScience-UIBK, University of Innsbruck
Issues: open one on this repo's Community tab.

Downloads last month: 27

Safetensors

Model size

9B params

Tensor type

F32

DataScience-UIBK
/

Argus-Colqwen3.5-9b-v0