I wanted to try out GRPO and LoRA on a vision-language model, just to see how far you can push the capabilities of a small VLM with minimal compute. The task I picked was PDF-to-markdown conversion: show the model an image of a PDF page and have it output well-structured markdown with headings, tables, code blocks, and correct reading order.
Here's what happened across 22 commits, 3 training configurations, and about $4 on RunPod.
Building the Dataset
Before training anything, I needed pairs of PDF page images and their corresponding markdown, so I built a pipeline that collects real markdown from diverse sources, renders it to styled PDFs, and then converts those PDFs into images.
Collecting Markdown
The collection pipeline pulls from four different sources to get a good variety of document styles and content types:
| Source | What | Count |
|---|---|---|
| GitHub repos | READMEs, docs, guides from 28 EN + 13 JA repos (React, PyTorch, Kubernetes, etc.) | ~600 files |
| Wikipedia | Programming/CS articles in English and Japanese | 45 articles |
| arXiv papers | LaTeX papers from 10 CS categories, converted to markdown via pandoc | 100 papers |
| GitHub LaTeX | Textbooks, templates, theses converted via pandoc | 15 repos |
Each file was filtered for quality, requiring a minimum of 200 characters and the presence of markdown markers like headings, tables, code fences, or lists, with a maximum of 50k characters per file to keep things manageable.
Rendering to PDF Images
The rendering pipeline converts markdown to HTML with styled CSS, then to PDF, and finally to PNG images at 150 DPI using PyMuPDF. To make the model more robust to visual variation, I randomized the rendering parameters per file, varying font sizes between 11px, 12px, and 14px, margins between 20-25mm, and using different font families for English (Georgia, Times) and Japanese (Hiragino Kaku Gothic Pro, Noto Sans CJK).
The tricky part was page-level splitting, since a single markdown document might render to 5+ PDF pages but the model needs to know which portion of the markdown corresponds to each page image. I split the markdown at paragraph boundaries (double newlines) and tracked character offsets for each page as page_start_char to page_end_char, so each image pairs with its specific markdown chunk rather than the entire document.
The final dataset came out to 500 training pairs and 20 test pairs, stratified by language so both English and Japanese are represented in the test set, and uploaded to HuggingFace as blazeofchi/pdf-ocr-rl-dataset.
The Model and Setup
I went with Qwen3-VL-2B-Instruct at 2.15 billion parameters, one of the smallest vision-language models available, running on a single NVIDIA A40 (48GB) on RunPod at $0.79/hour. I wrote a script that provisions the pod via the RunPod GraphQL API, installs dependencies, and sets up SSH access automatically.
For fine-tuning I used QLoRA with r=32 and alpha=64, making only 2.18% of the model's parameters trainable while keeping the rest frozen.
The GRPO reward function measures four aspects of conversion quality:
| Component | Weight | What it measures |
|---|---|---|
| Edit distance | 0.4 | Character-level Levenshtein similarity to reference |
| Reading order | 0.25 | Content blocks in correct sequence |
| Heading accuracy | 0.2 | Heading detection precision and recall |
| Structural validity | 0.15 | Well-formed markdown structure |
Iterating on the Config
The training config wasn't static and evolved across commits as I hit problems and learned what works:
| Change | Why |
|---|---|
| Qwen2.5-VL-3B to Qwen3-VL-2B | Smaller, faster, better Unsloth support |
| LoRA r=16 to r=32 | More capacity for the adapter to learn the task |
| 6 generations to 4 | Memory constraints on A40 |
Added strip_thinking() | Qwen3 wraps output in <think> blocks, and the reward function was scoring thinking tokens instead of actual markdown |
| Added SFT warm-up | GRPO alone completely failed (the big one) |
The reward weights also changed between experiments, and this turned out to matter more than I expected:
| Component | v2 (best) | v3 (worse) |
|---|---|---|
| edit_distance | 0.4 | 0.6 |
| reading_order | 0.25 | 0.15 |
| heading | 0.2 | 0.15 |
| structural | 0.15 | 0.1 |
The v3 config weighted edit distance more heavily at 0.6 vs 0.4, based on the assumption that character fidelity was the most important signal, but it actually produced worse results across the board. The v2 weights gave more room to structural and reading order metrics, and this led to a model that was better at the structure of markdown, which is what actually matters for document conversion.
Attempt 1: GRPO Alone (It Failed)
The first thing I tried was pure GRPO directly on the base model without any supervised warm-up, just giving it PDF images, letting it generate markdown, scoring the outputs with the composite reward function, and hoping the gradients would do their thing.
The result was that nothing changed at all. The gradient norm was 0.0000047, essentially zero, and the model wasn't learning anything.
The reason turned out to be fundamental to how GRPO works: it samples multiple outputs for the same input and compares their rewards to compute advantages, which means it needs variance in the outputs. But the base model, when shown a PDF image, generates outputs that are almost identical to each other, and the reward standard deviation across a group of 4 generations was only 0.017, meaning everything scored roughly the same and there was no signal to learn from.
Text models generate diverse reasoning chains for the same problem, which is why GRPO works well there. Vision models generating markdown from a specific image converge to nearly the same output every time, leaving GRPO with nothing to differentiate.
This was my first major learning: GRPO alone won't work when the model's outputs have no diversity.
Attempt 2: SFT + GRPO (It Worked)
The fix was obvious in hindsight: teach the model the task first with supervised fine-tuning, and then optimize with RL.
Stage 1, SFT Warm-up (100 steps) showed the model image-markdown pairs and trained it supervised using Unsloth's SFTTrainer, which moved the weights enough that the model started generating diverse outputs, with loss dropping from 1.295 to 0.78 and gradient norm at a healthy 1.85.
Stage 2, GRPO Refinement (100 steps) now had something to work with because the model's outputs were varied enough to produce real reward differences, and the reward climbed from 0.66 to 0.74.
The gradient difference between the two approaches was 400,000x, which makes it clear that SFT warm-up isn't optional for GRPO on vision-language models.
Results
Evaluated on 20 held-out test samples with bf16 inference on A40:
| Metric | Base Model | Fine-tuned | Change |
|---|---|---|---|
| Heading Precision | 0.855 | 0.930 | +7.5% |
| Heading F1 | 0.840 | 0.894 | +5.4% |
| Code Block Similarity | 0.578 | 0.757 | +18.0% |
| Code Block Count | 0.333 | 0.515 | +18.2% |
| Word Precision | 0.756 | 0.790 | +3.5% |
| Word F1 | 0.715 | 0.731 | +1.6% |
| Edit Distance | 0.753 | 0.735 | -1.8% |
The edit distance went down slightly, but that's actually expected and fine because the fine-tuned model generates better-structured markdown with proper headings and code blocks, which differs character-by-character from the reference even though it's semantically superior. Word-level F1, which measures content overlap at the word level rather than exact character alignment, confirms the improvement.
Attempt 3: More SFT (Diminishing Returns)
I also tried extending SFT to 200 steps (2x the original) to see if more supervised training alone would help. Edit distance improved slightly at +0.8%, but heading F1 dropped by -3.3% and code similarity dropped by -2.7%, which suggests that more SFT without GRPO doesn't improve structural understanding and just makes the model better at character-level copying.
The Full Comparison
| Config | edit_dist | heading_f1 | code_sim | word_f1 |
|---|---|---|---|---|
| Base (no training) | 0.753 | 0.840 | 0.578 | 0.715 |
| GRPO only | 0.753 | 0.840 | 0.578 | 0.715 |
| SFT + GRPO | 0.735 | 0.894 | 0.757 | 0.731 |
| Extended SFT (200 steps) | 0.761 | 0.807 | 0.551 | 0.717 |
Evaluation Metrics
One meta-learning from this project is that choosing the right evaluation metric matters as much as choosing the right training method:
| Metric | What it measures | Verdict |
|---|---|---|
| Levenshtein edit distance | Character-level similarity | Too strict, penalizes better formatting |
| Word F1 | Word-level overlap (precision + recall) | Better for measuring content quality |
| Heading F1 | Heading detection (precision + recall) | Good for structural evaluation |
| Code block similarity | Code content accuracy | Good for code-heavy documents |
Levenshtein ratio penalizes any reformatting even when the content is semantically correct, so a model that adds proper headings or code blocks will score lower on edit distance despite being better. Word-level F1 is more forgiving and better captures what actually matters for document conversion quality.
What I'd Do Differently
A larger test set would be the first priority, since 20 samples is enough to see trends but not enough for statistical significance, and 100-200 would give much more confidence in the results. The dataset is also entirely synthetic (rendered markdown to PDF to image), so real scanned documents with noise, varying layouts, and actual typography would be a harder and more useful test. And I didn't benchmark against existing OCR systems like Nougat, GOT-OCR, or olmOCR, which would provide useful context for how these results compare to the state of the art.
Models and Code
Everything is open:
| Resource | Link |
|---|---|
| Best model (SFT+GRPO) | blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-grpo |
| SFT checkpoint | blazeofchi/pdf-ocr-rl-qwen3vl2b-sft-only |
| Dataset | blazeofchi/pdf-ocr-rl-dataset |
| Code | github.com/Parassharmaa/pdf-ocr-rl |
Total cost was ~$4 on RunPod, and the entire thing including dataset creation, 3 training runs, and evaluation took about 12 hours of GPU time on an A40.
The main takeaway is that GRPO is powerful but on vision-language models it needs SFT to bootstrap the process. Pure RL works when your model naturally generates diverse outputs, but when it doesn't, and vision models generating structured output from a specific image usually don't, you need supervised training to create that diversity first.
What's Next
Next I want to collect a much larger and more diverse dataset with real-world PDFs, not just synthetically rendered ones, and properly benchmark these fine-tuned models against established systems using the olmOCR-bench dataset to see where a small fine-tuned VLM actually stands in the broader landscape.