OmniParser Florence-2 Fine-tuned Icon Captioner
Fine-tuned Florence-2-base-ft for UI icon captioning, used as the caption model in OmniParser v2.
This model extends the original OmniParser icon captioning weights with:
- 1,970 Font Awesome icons with rotated synonym labels and diverse augmentations
- Custom Google Maps icons: Street View pegman, Street View rotation controls
- Hard negative training to prevent false positives on similar-looking icons
What changed from the base OmniParser weights
Training data
- Font Awesome Free: 1,970 icons across solid/regular/brands styles, each with multiple synonym labels from FA metadata (e.g., sleigh icon trained with labels "sleigh", "christmas", "sled", "santa", "reindeer")
- 70/30 weighted sampling: Primary icon name gets 70% of training steps, alternate synonyms get 30%
- Label smoothing (0.1): Prevents overconfidence on any single synonym
- Screenshot anchors: 91 icons from real Google Maps screenshots with the original model's own captions (prevents vocabulary drift)
- Hard negatives: 3 specific crops that were false positives in earlier training rounds
Augmentations (training-time)
- Color inversion (black/white swap)
- Random foreground recoloring on white/gray backgrounds
- White foreground on random colored backgrounds
- Brightness, contrast, rotation, blur, tint
- Random rescale (downscale then upscale for aliasing artifacts)
- JPEG compression (quality 15-60 for SVG icons, 50-85 for photo crops)
Architecture
- Vision encoder: frozen (90.4M params) โ preserves general icon feature extraction
- Language decoder: trained (141.0M params) โ learns new caption mappings
- 4 epochs, LR 2e-6, AdamW, batch size 8
Benchmark: Google Maps screenshot (2570x2002, H100)
| Metric | Before | After |
|---|---|---|
| Florence-2 latency | 342ms | 148ms |
| Total elements | 316 | 316 |
| Icons to Florence | 71 | 71 |
Key caption improvements
| Icon | Before | After |
|---|---|---|
| Street View pegman | "A notification or alert." | "pegman" |
| Rotation control (BL) | "Refresh or reload." | "street view rotation" |
| Rotation control (BR) | "A painting or painting tool." | "street view rotation" |
| Location marker | "Location or location marker." | "Location" |
| User profile | "a user profile or account." | "user profile" |
| Record player | "a record player." | "record player" |
| Suitcase | "a suitcase or baggage." | "suitcase" |
47 icon captions changed total. Most changes are minor wording improvements (shorter, more precise). 3 new custom icon types learned.
Usage
from transformers import AutoProcessor, AutoModelForCausalLM
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"proteus-computer-use/omniparser-finetuned",
torch_dtype=torch.float16,
trust_remote_code=True,
).to("cuda")
# Caption a 64x64 icon crop
inputs = processor(images=icon_crop, text="<CAPTION>", return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=20,
num_beams=1,
)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Related
- OmniParser v2 โ full UI parsing pipeline
- omniparser-fast โ low-latency GPU server with this model
- Florence-2 โ base model
- Downloads last month
- 35
Model tree for Proteus-Computer-Use/omniparser-finetuned
Base model
microsoft/Florence-2-base-ft