Miaow-Lab
/

RLVR-Linearity-Checkpoints

Text Generation

Model card Files Files and versions

RLVR-Linearity-Checkpoints / README.md

nielsr's picture

nielsr HF Staff

Add library_name and improve model card metadata

2c1d702 verified 4 months ago

|

2.1 kB

	---
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
	datasets:
	- Miaow-Lab/RLVR-Linearity-Dataset
	license: apache-2.0
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- reasoning
	- grpo
	- reinforcement-learning
	---

	# Model Card

	## 1. Model Details
	This model is a fine-tuned checkpoint described in the paper "Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training". It was trained using Reinforcement Learning (RL) to investigate the phenomenon of linear evolution in model weights and output log-probabilities during RLVR training.

	- Paper: [Not All Steps are Informative: On the Linearity of LLMs' RLVR Training](https://huggingface.co/papers/2601.04537)
	- Code: [GitHub Repository](https://github.com/Miaow-Lab/RLVR-Linearity)
	- Base Model: [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
	- Training Method: GRPO (using the `verl` framework)


	## 2. Training Details

	- Hyperparameters:
	- Learning Rate: `1e-6`
	- Train Batch Size: `128`
	- PPO Mini Batch Size: `64`
	- RL Algorithm: `GRPO`
	- Rollout Temperature: 1.0
	- Group Size: 16
	- Compute: Trained on `32 x H100` GPUs for about `150` hours.

	For full training configurations, please refer to the `config.json` or the training scripts in the official [GitHub repository](https://github.com/Miaow-Lab/RLVR-Linearity).

	## 3. Citation

	If you use this model in your research, please cite our paper:

	```bibtex
	@misc{wang2026stepsinformativelinearityllms,
	title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training},
	author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao},
	year={2026},
	eprint={2601.04537},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2601.04537},
	}
	```

	> [!TIP]
	> Motivation for this Model
	> This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning.