PDF-OCR-RL: Qwen3-VL-2B SFT + GRPO (Best Model)

Fine-tuned Qwen3-VL-2B-Instruct for PDF-to-markdown conversion using a two-stage pipeline: Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO).

This is a LoRA adapter (r=32, alpha=64, 2.18% trainable parameters). Load it on top of the base model using PEFT.

Evaluation Results

Evaluated on 20 held-out test samples from blazeofchi/pdf-ocr-rl-dataset. All inference in bf16 on NVIDIA A40.

Metric Base Model This Model Delta
Heading Precision 0.8550 0.9300 +7.5%
Heading F1 0.8400 0.8943 +5.4%
Code Block Similarity 0.5775 0.7571 +18.0%
Code Block Count Match 0.3333 0.5150 +18.2%
Word Precision 0.7557 0.7903 +3.5%
Word F1 0.7151 0.7308 +1.6%
Edit Distance 0.7525 0.7346 -1.8%
Table Count Match 1.0000 0.9500 -5.0%

The small edit distance regression is expected — the fine-tuned model generates better-structured markdown (more headings, better code blocks) which differs at the character level from the reference. Word F1 and structural metrics are more meaningful for evaluating document conversion quality.

Training Pipeline

Stage 1: SFT Warm-up (100 steps)

Teaches the model the image-to-markdown mapping using supervised examples.

Parameter Value
Learning rate 2e-5
Batch size 2
Framework Unsloth + TRL SFTTrainer
Loss curve 1.295 → 0.78
Grad norm ~1.85 (strong learning signal)

Stage 2: GRPO Refinement (100 steps)

Optimizes a composite reward function using group relative policy optimization.

Parameter Value
Learning rate 5e-6
Num generations 4 per prompt
Max completion length 1024 tokens
Optimizer AdamW 8-bit
Framework Unsloth + TRL GRPOTrainer
Reward curve 0.66 → 0.74

Reward weights:

Component Weight What it measures
edit_distance 0.4 Character-level Levenshtein similarity
reading_order 0.25 Correct ordering of content blocks
heading 0.2 Heading detection precision and recall
structural 0.15 Markdown structure validity

Technical Details

Detail Value
Base model unsloth/Qwen3-VL-2B-Instruct (2.15B params)
LoRA rank (r) 32
LoRA alpha 64
LoRA dropout 0.0
Target modules All linear layers (attention + MLP)
Trainable parameters 2.18% of total
Precision bf16
Hardware NVIDIA A40 48GB (RunPod)
Training time ~1.5 hours total
Training cost ~$2
Dataset 500 train / 20 test from blazeofchi/pdf-ocr-rl-dataset
PEFT version 0.18.1

Key Learnings from Experiments

We ran 4 different training configurations before arriving at this best model. Key findings:

  1. GRPO alone fails on vision-language models. Without SFT warm-up, GRPO produces near-zero gradients (grad_norm ≈ 4.7e-6) because the base model generates outputs that are too similar to each other (reward_std = 0.017). SFT first diversifies the model's outputs enough for GRPO to work.

  2. Unsloth's vision data collator is critical. Manual training loops that train on the full conversation (including user/system tokens) are significantly less effective than Unsloth's completion-only masking via UnslothVisionDataCollator. Our manual v3 training showed -5.5% heading F1 regression vs. this model's +5.4% improvement.

  3. Character-level edit distance is too strict. Levenshtein ratio penalizes any reformatting even when semantically correct. Word-level F1 better captures content preservation quality.

  4. Vision models need special GRPO config. max_prompt_length must be ≥4096 (default 1024 truncates image tokens). PeftModel.from_pretrained() needs is_trainable=True when loading for continued training.

All Configurations Compared

Model edit_dist heading_f1 code_sim word_f1
Base (no fine-tuning) 0.7526 0.8400 0.5775 0.7151
SFT+GRPO v2 (this model) 0.7346 0.8943 0.7571 0.7308
GRPO-only (no SFT) 0.7526 0.8400 0.5775 0.7151
SFT+GRPO v3 (manual loop) 0.7490 0.7852 0.5682 0.7155
Extended SFT 200 steps 0.7609 0.8067 0.5505 0.7170

Usage

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch

# Load base model
base_model = Qwen3VLForConditionalGeneration.from_pretrained(
    "unsloth/Qwen3-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load this LoRA adapter
model = PeftModel.from_pretrained(
    base_model, "blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo"
)
processor = AutoProcessor.from_pretrained(
    "blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo"
)

# Prepare input
image = Image.open("page.png")
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Convert this PDF page to well-structured markdown."}
    ]}
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text], images=image_inputs,
    padding=True, return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
result = processor.decode(
    output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True
)
print(result)

Other Models in This Series

Model Description
blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo SFT + GRPO (this model, best)
blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-only SFT-only checkpoint (intermediate)

Citation

@misc{pdf-ocr-rl-2026,
  title={PDF-OCR-RL: Fine-tuning Vision-Language Models for PDF-to-Markdown with GRPO},
  author={Paras Sharma},
  year={2026},
  url={https://github.com/Parassharmaa/pdf-ocr-rl}
}

License

Apache 2.0 (same as base model)

Downloads last month
71
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo

Finetuned
(32)
this model

Dataset used to train blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo

Evaluation results