PDF-OCR-RL: Qwen3-VL-2B SFT + GRPO (Best Model)

Fine-tuned Qwen3-VL-2B-Instruct for PDF-to-markdown conversion using a two-stage pipeline: Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO).

This is a LoRA adapter (r=32, alpha=64, 2.18% trainable parameters). Load it on top of the base model using PEFT.

Evaluation Results

Evaluated on 20 held-out test samples from blazeofchi/pdf-ocr-rl-dataset. All inference in bf16 on NVIDIA A40.

Metric	Base Model	This Model	Delta
Heading Precision	0.8550	0.9300	+7.5%
Heading F1	0.8400	0.8943	+5.4%
Code Block Similarity	0.5775	0.7571	+18.0%
Code Block Count Match	0.3333	0.5150	+18.2%
Word Precision	0.7557	0.7903	+3.5%
Word F1	0.7151	0.7308	+1.6%
Edit Distance	0.7525	0.7346	-1.8%
Table Count Match	1.0000	0.9500	-5.0%

The small edit distance regression is expected — the fine-tuned model generates better-structured markdown (more headings, better code blocks) which differs at the character level from the reference. Word F1 and structural metrics are more meaningful for evaluating document conversion quality.

Training Pipeline

Stage 1: SFT Warm-up (100 steps)

Teaches the model the image-to-markdown mapping using supervised examples.

Parameter	Value
Learning rate	2e-5
Batch size	2
Framework	Unsloth + TRL SFTTrainer
Loss curve	1.295 → 0.78
Grad norm	~1.85 (strong learning signal)

Stage 2: GRPO Refinement (100 steps)

Optimizes a composite reward function using group relative policy optimization.

Parameter	Value
Learning rate	5e-6
Num generations	4 per prompt
Max completion length	1024 tokens
Optimizer	AdamW 8-bit
Framework	Unsloth + TRL GRPOTrainer
Reward curve	0.66 → 0.74

Reward weights:

Component	Weight	What it measures
edit_distance	0.4	Character-level Levenshtein similarity
reading_order	0.25	Correct ordering of content blocks
heading	0.2	Heading detection precision and recall
structural	0.15	Markdown structure validity

Technical Details

Detail	Value
Base model	unsloth/Qwen3-VL-2B-Instruct (2.15B params)
LoRA rank (r)	32
LoRA alpha	64
LoRA dropout	0.0
Target modules	All linear layers (attention + MLP)
Trainable parameters	2.18% of total
Precision	bf16
Hardware	NVIDIA A40 48GB (RunPod)
Training time	~1.5 hours total
Training cost	~$2
Dataset	500 train / 20 test from blazeofchi/pdf-ocr-rl-dataset
PEFT version	0.18.1

Key Learnings from Experiments

We ran 4 different training configurations before arriving at this best model. Key findings:

GRPO alone fails on vision-language models. Without SFT warm-up, GRPO produces near-zero gradients (grad_norm ≈ 4.7e-6) because the base model generates outputs that are too similar to each other (reward_std = 0.017). SFT first diversifies the model's outputs enough for GRPO to work.
Unsloth's vision data collator is critical. Manual training loops that train on the full conversation (including user/system tokens) are significantly less effective than Unsloth's completion-only masking via UnslothVisionDataCollator. Our manual v3 training showed -5.5% heading F1 regression vs. this model's +5.4% improvement.
Character-level edit distance is too strict. Levenshtein ratio penalizes any reformatting even when semantically correct. Word-level F1 better captures content preservation quality.
Vision models need special GRPO config. max_prompt_length must be ≥4096 (default 1024 truncates image tokens). PeftModel.from_pretrained() needs is_trainable=True when loading for continued training.

All Configurations Compared

Model	edit_dist	heading_f1	code_sim	word_f1
Base (no fine-tuning)	0.7526	0.8400	0.5775	0.7151
SFT+GRPO v2 (this model)	0.7346	0.8943	0.7571	0.7308
GRPO-only (no SFT)	0.7526	0.8400	0.5775	0.7151
SFT+GRPO v3 (manual loop)	0.7490	0.7852	0.5682	0.7155
Extended SFT 200 steps	0.7609	0.8067	0.5505	0.7170

Usage

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch

# Load base model
base_model = Qwen3VLForConditionalGeneration.from_pretrained(
    "unsloth/Qwen3-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load this LoRA adapter
model = PeftModel.from_pretrained(
    base_model, "blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo"
)
processor = AutoProcessor.from_pretrained(
    "blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo"
)

# Prepare input
image = Image.open("page.png")
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Convert this PDF page to well-structured markdown."}
    ]}
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text], images=image_inputs,
    padding=True, return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
result = processor.decode(
    output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True
)
print(result)

Other Models in This Series

Model	Description
blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo	SFT + GRPO (this model, best)
blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-only	SFT-only checkpoint (intermediate)

Citation

@misc{pdf-ocr-rl-2026,
  title={PDF-OCR-RL: Fine-tuning Vision-Language Models for PDF-to-Markdown with GRPO},
  author={Paras Sharma},
  year={2026},
  url={https://github.com/Parassharmaa/pdf-ocr-rl}
}

License

Apache 2.0 (same as base model)

Downloads last month: 71

Model tree for blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

unsloth/Qwen3-VL-2B-Instruct

Finetuned

(32)

this model

Dataset used to train blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo

Evaluation results

Heading F1 on pdf-ocr-rl-dataset
test set self-reported

0.894
Heading Precision on pdf-ocr-rl-dataset
test set self-reported

0.930
Code Block Similarity on pdf-ocr-rl-dataset
test set self-reported

0.757
Code Block Count Match on pdf-ocr-rl-dataset
test set self-reported

0.515
Word F1 on pdf-ocr-rl-dataset
test set self-reported

0.731
Word Precision on pdf-ocr-rl-dataset
test set self-reported

0.790
Edit Distance (Levenshtein) on pdf-ocr-rl-dataset
test set self-reported

0.735