OmniParser Florence-2 Fine-tuned Icon Captioner

Fine-tuned Florence-2-base-ft for UI icon captioning, used as the caption model in OmniParser v2.

This model extends the original OmniParser icon captioning weights with:

  • 1,970 Font Awesome icons with rotated synonym labels and diverse augmentations
  • Custom Google Maps icons: Street View pegman, Street View rotation controls
  • Hard negative training to prevent false positives on similar-looking icons

What changed from the base OmniParser weights

Training data

  • Font Awesome Free: 1,970 icons across solid/regular/brands styles, each with multiple synonym labels from FA metadata (e.g., sleigh icon trained with labels "sleigh", "christmas", "sled", "santa", "reindeer")
  • 70/30 weighted sampling: Primary icon name gets 70% of training steps, alternate synonyms get 30%
  • Label smoothing (0.1): Prevents overconfidence on any single synonym
  • Screenshot anchors: 91 icons from real Google Maps screenshots with the original model's own captions (prevents vocabulary drift)
  • Hard negatives: 3 specific crops that were false positives in earlier training rounds

Augmentations (training-time)

  • Color inversion (black/white swap)
  • Random foreground recoloring on white/gray backgrounds
  • White foreground on random colored backgrounds
  • Brightness, contrast, rotation, blur, tint
  • Random rescale (downscale then upscale for aliasing artifacts)
  • JPEG compression (quality 15-60 for SVG icons, 50-85 for photo crops)

Architecture

  • Vision encoder: frozen (90.4M params) โ€” preserves general icon feature extraction
  • Language decoder: trained (141.0M params) โ€” learns new caption mappings
  • 4 epochs, LR 2e-6, AdamW, batch size 8

Benchmark: Google Maps screenshot (2570x2002, H100)

Metric Before After
Florence-2 latency 342ms 148ms
Total elements 316 316
Icons to Florence 71 71

Key caption improvements

Icon Before After
Street View pegman "A notification or alert." "pegman"
Rotation control (BL) "Refresh or reload." "street view rotation"
Rotation control (BR) "A painting or painting tool." "street view rotation"
Location marker "Location or location marker." "Location"
User profile "a user profile or account." "user profile"
Record player "a record player." "record player"
Suitcase "a suitcase or baggage." "suitcase"

47 icon captions changed total. Most changes are minor wording improvements (shorter, more precise). 3 new custom icon types learned.

Usage

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "proteus-computer-use/omniparser-finetuned",
    torch_dtype=torch.float16,
    trust_remote_code=True,
).to("cuda")

# Caption a 64x64 icon crop
inputs = processor(images=icon_crop, text="<CAPTION>", return_tensors="pt").to("cuda")
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=20,
    num_beams=1,
)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Related

Downloads last month
35
Safetensors
Model size
0.2B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Proteus-Computer-Use/omniparser-finetuned

Finetuned
(20)
this model