| --- |
| base_model: |
| - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
| datasets: |
| - Miaow-Lab/RLVR-Linearity-Dataset |
| license: apache-2.0 |
| pipeline_tag: text-generation |
| library_name: transformers |
| tags: |
| - reasoning |
| - grpo |
| - reinforcement-learning |
| --- |
| |
| # Model Card |
|
|
| ## 1. Model Details |
| This model is a fine-tuned checkpoint described in the paper **"Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training"**. It was trained using Reinforcement Learning (RL) to investigate the phenomenon of linear evolution in model weights and output log-probabilities during RLVR training. |
|
|
| - **Paper:** [Not All Steps are Informative: On the Linearity of LLMs' RLVR Training](https://huggingface.co/papers/2601.04537) |
| - **Code:** [GitHub Repository](https://github.com/Miaow-Lab/RLVR-Linearity) |
| - **Base Model:** [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) |
| - **Training Method:** GRPO (using the `verl` framework) |
|
|
|
|
| ## 2. Training Details |
|
|
| - **Hyperparameters:** |
| - Learning Rate: `1e-6` |
| - Train Batch Size: `128` |
| - PPO Mini Batch Size: `64` |
| - RL Algorithm: `GRPO` |
| - Rollout Temperature: 1.0 |
| - Group Size: 16 |
| - **Compute:** Trained on `32 x H100` GPUs for about `150` hours. |
|
|
| For full training configurations, please refer to the `config.json` or the training scripts in the official [GitHub repository](https://github.com/Miaow-Lab/RLVR-Linearity). |
|
|
| ## 3. Citation |
|
|
| If you use this model in your research, please cite our paper: |
|
|
| ```bibtex |
| @misc{wang2026stepsinformativelinearityllms, |
| title={Not All Steps are Informative: On the Linearity of LLMs' RLVR Training}, |
| author={Tianle Wang and Zhongyuan Wu and Shenghao Jin and Hao Xu and Wei Chen and Ning Miao}, |
| year={2026}, |
| eprint={2601.04537}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.LG}, |
| url={https://arxiv.org/abs/2601.04537}, |
| } |
| ``` |
|
|
| > [!TIP] |
| > **Motivation for this Model** |
| > This checkpoint is released primarily as a research artifact to facilitate the analysis of linearity in model outputs and weight updates during RLVR fine‑tuning. |