PDF-OCR-RL: Qwen3-VL-2B SFT + GRPO (Best Model)
Fine-tuned Qwen3-VL-2B-Instruct for PDF-to-markdown conversion using a two-stage pipeline: Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO).
This is a LoRA adapter (r=32, alpha=64, 2.18% trainable parameters). Load it on top of the base model using PEFT.
Evaluation Results
Evaluated on 20 held-out test samples from blazeofchi/pdf-ocr-rl-dataset. All inference in bf16 on NVIDIA A40.
| Metric | Base Model | This Model | Delta |
|---|---|---|---|
| Heading Precision | 0.8550 | 0.9300 | +7.5% |
| Heading F1 | 0.8400 | 0.8943 | +5.4% |
| Code Block Similarity | 0.5775 | 0.7571 | +18.0% |
| Code Block Count Match | 0.3333 | 0.5150 | +18.2% |
| Word Precision | 0.7557 | 0.7903 | +3.5% |
| Word F1 | 0.7151 | 0.7308 | +1.6% |
| Edit Distance | 0.7525 | 0.7346 | -1.8% |
| Table Count Match | 1.0000 | 0.9500 | -5.0% |
The small edit distance regression is expected — the fine-tuned model generates better-structured markdown (more headings, better code blocks) which differs at the character level from the reference. Word F1 and structural metrics are more meaningful for evaluating document conversion quality.
Training Pipeline
Stage 1: SFT Warm-up (100 steps)
Teaches the model the image-to-markdown mapping using supervised examples.
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Batch size | 2 |
| Framework | Unsloth + TRL SFTTrainer |
| Loss curve | 1.295 → 0.78 |
| Grad norm | ~1.85 (strong learning signal) |
Stage 2: GRPO Refinement (100 steps)
Optimizes a composite reward function using group relative policy optimization.
| Parameter | Value |
|---|---|
| Learning rate | 5e-6 |
| Num generations | 4 per prompt |
| Max completion length | 1024 tokens |
| Optimizer | AdamW 8-bit |
| Framework | Unsloth + TRL GRPOTrainer |
| Reward curve | 0.66 → 0.74 |
Reward weights:
| Component | Weight | What it measures |
|---|---|---|
| edit_distance | 0.4 | Character-level Levenshtein similarity |
| reading_order | 0.25 | Correct ordering of content blocks |
| heading | 0.2 | Heading detection precision and recall |
| structural | 0.15 | Markdown structure validity |
Technical Details
| Detail | Value |
|---|---|
| Base model | unsloth/Qwen3-VL-2B-Instruct (2.15B params) |
| LoRA rank (r) | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.0 |
| Target modules | All linear layers (attention + MLP) |
| Trainable parameters | 2.18% of total |
| Precision | bf16 |
| Hardware | NVIDIA A40 48GB (RunPod) |
| Training time | ~1.5 hours total |
| Training cost | ~$2 |
| Dataset | 500 train / 20 test from blazeofchi/pdf-ocr-rl-dataset |
| PEFT version | 0.18.1 |
Key Learnings from Experiments
We ran 4 different training configurations before arriving at this best model. Key findings:
GRPO alone fails on vision-language models. Without SFT warm-up, GRPO produces near-zero gradients (grad_norm ≈ 4.7e-6) because the base model generates outputs that are too similar to each other (reward_std = 0.017). SFT first diversifies the model's outputs enough for GRPO to work.
Unsloth's vision data collator is critical. Manual training loops that train on the full conversation (including user/system tokens) are significantly less effective than Unsloth's completion-only masking via
UnslothVisionDataCollator. Our manual v3 training showed -5.5% heading F1 regression vs. this model's +5.4% improvement.Character-level edit distance is too strict. Levenshtein ratio penalizes any reformatting even when semantically correct. Word-level F1 better captures content preservation quality.
Vision models need special GRPO config.
max_prompt_lengthmust be ≥4096 (default 1024 truncates image tokens).PeftModel.from_pretrained()needsis_trainable=Truewhen loading for continued training.
All Configurations Compared
| Model | edit_dist | heading_f1 | code_sim | word_f1 |
|---|---|---|---|---|
| Base (no fine-tuning) | 0.7526 | 0.8400 | 0.5775 | 0.7151 |
| SFT+GRPO v2 (this model) | 0.7346 | 0.8943 | 0.7571 | 0.7308 |
| GRPO-only (no SFT) | 0.7526 | 0.8400 | 0.5775 | 0.7151 |
| SFT+GRPO v3 (manual loop) | 0.7490 | 0.7852 | 0.5682 | 0.7155 |
| Extended SFT 200 steps | 0.7609 | 0.8067 | 0.5505 | 0.7170 |
Usage
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch
# Load base model
base_model = Qwen3VLForConditionalGeneration.from_pretrained(
"unsloth/Qwen3-VL-2B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load this LoRA adapter
model = PeftModel.from_pretrained(
base_model, "blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo"
)
processor = AutoProcessor.from_pretrained(
"blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo"
)
# Prepare input
image = Image.open("page.png")
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Convert this PDF page to well-structured markdown."}
]}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text], images=image_inputs,
padding=True, return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
result = processor.decode(
output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True
)
print(result)
Other Models in This Series
| Model | Description |
|---|---|
| blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo | SFT + GRPO (this model, best) |
| blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-only | SFT-only checkpoint (intermediate) |
Citation
@misc{pdf-ocr-rl-2026,
title={PDF-OCR-RL: Fine-tuning Vision-Language Models for PDF-to-Markdown with GRPO},
author={Paras Sharma},
year={2026},
url={https://github.com/Parassharmaa/pdf-ocr-rl}
}
License
Apache 2.0 (same as base model)
- Downloads last month
- 71
Model tree for blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo
Base model
Qwen/Qwen3-VL-2B-InstructDataset used to train blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo
Evaluation results
- Heading F1 on pdf-ocr-rl-datasettest set self-reported0.894
- Heading Precision on pdf-ocr-rl-datasettest set self-reported0.930
- Code Block Similarity on pdf-ocr-rl-datasettest set self-reported0.757
- Code Block Count Match on pdf-ocr-rl-datasettest set self-reported0.515
- Word F1 on pdf-ocr-rl-datasettest set self-reported0.731
- Word Precision on pdf-ocr-rl-datasettest set self-reported0.790
- Edit Distance (Levenshtein) on pdf-ocr-rl-datasettest set self-reported0.735