OmniParser Florence-2 Fine-tuned Icon Captioner

Fine-tuned Florence-2-base-ft for UI icon captioning, used as the caption model in OmniParser v2.

This model extends the original OmniParser icon captioning weights with:

1,970 Font Awesome icons with rotated synonym labels and diverse augmentations
Custom Google Maps icons: Street View pegman, Street View rotation controls
Hard negative training to prevent false positives on similar-looking icons

What changed from the base OmniParser weights

Training data

Font Awesome Free: 1,970 icons across solid/regular/brands styles, each with multiple synonym labels from FA metadata (e.g., sleigh icon trained with labels "sleigh", "christmas", "sled", "santa", "reindeer")
70/30 weighted sampling: Primary icon name gets 70% of training steps, alternate synonyms get 30%
Label smoothing (0.1): Prevents overconfidence on any single synonym
Screenshot anchors: 91 icons from real Google Maps screenshots with the original model's own captions (prevents vocabulary drift)
Hard negatives: 3 specific crops that were false positives in earlier training rounds

Augmentations (training-time)

Color inversion (black/white swap)
Random foreground recoloring on white/gray backgrounds
White foreground on random colored backgrounds
Brightness, contrast, rotation, blur, tint
Random rescale (downscale then upscale for aliasing artifacts)
JPEG compression (quality 15-60 for SVG icons, 50-85 for photo crops)

Architecture

Vision encoder: frozen (90.4M params) — preserves general icon feature extraction
Language decoder: trained (141.0M params) — learns new caption mappings
4 epochs, LR 2e-6, AdamW, batch size 8

Benchmark: Google Maps screenshot (2570x2002, H100)

Metric	Before	After
Florence-2 latency	342ms	148ms
Total elements	316	316
Icons to Florence	71	71

Key caption improvements

Icon	Before	After
Street View pegman	"A notification or alert."	"pegman"
Rotation control (BL)	"Refresh or reload."	"street view rotation"
Rotation control (BR)	"A painting or painting tool."	"street view rotation"
Location marker	"Location or location marker."	"Location"
User profile	"a user profile or account."	"user profile"
Record player	"a record player."	"record player"
Suitcase	"a suitcase or baggage."	"suitcase"

47 icon captions changed total. Most changes are minor wording improvements (shorter, more precise). 3 new custom icon types learned.

Usage

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "proteus-computer-use/omniparser-finetuned",
    torch_dtype=torch.float16,
    trust_remote_code=True,
).to("cuda")

# Caption a 64x64 icon crop
inputs = processor(images=icon_crop, text="<CAPTION>", return_tensors="pt").to("cuda")
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=20,
    num_beams=1,
)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

OmniParser v2 — full UI parsing pipeline
omniparser-fast — low-latency GPU server with this model
Florence-2 — base model

Downloads last month: 35

Safetensors

Model size

0.2B params

Tensor type

F16

Model tree for Proteus-Computer-Use/omniparser-finetuned

Base model

microsoft/Florence-2-base-ft

Finetuned

(20)

this model

Proteus-Computer-Use
/

omniparser-finetuned