File size: 4,999 Bytes
b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 2e8b74a b723c73 2e8b74a b723c73 06d103f b723c73 06d103f b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 b9a34e5 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb b723c73 18981cb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | ---
license: apache-2.0
language:
- en
- code
library_name: PyLate
tags:
- ColBERT
- PyLate
- sentence-transformers
- code-search
- code-retrieval
- late-interaction
- reasoning
base_model: lightonai/GTE-ModernColBERT-v1
datasets:
- nomic-ai/cornstack-python-v1
- nomic-ai/cornstack-java-v1
- nomic-ai/cornstack-javascript-v1
- nomic-ai/cornstack-php-v1
- nomic-ai/cornstack-go-v1
- nomic-ai/cornstack-ruby-v1
pipeline_tag: sentence-similarity
---
# Reason-Code-ModernColBERT
The **first reasoning-enhanced ColBERT model for code search and retrieval**.
Extends the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. Built on research from [LightOn AI](https://huggingface.co/lightonai) (ColBERT for code) and [Facebook Research](https://github.com/facebookresearch/ReasonIR) (reasoning-enhanced retrieval).
## Why Reasoning-Enhanced Training for Code?
Standard code search training uses docstring→code pairs. Our approach generates **reasoning-intensive queries** that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) to outperform 7B dense models on reasoning tasks at only 150M parameters.
## Model Details
| Property | Value |
|---|---|
| **Base model** | [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) |
| **Architecture** | ColBERT (late-interaction, multi-vector) |
| **Parameters** | 150M |
| **Embedding dim** | 128 per token |
| **Document length** | 512 tokens |
| **Query length** | 128 tokens |
| **Similarity** | MaxSim |
| **Languages** | Python, Java, JavaScript, PHP, Go, Ruby |
| **License** | Apache 2.0 |
## Training
### Two-Stage Training Pipeline
**Stage 1: CoRNStack Base (1 epoch)**
- 100,000 high-quality code search pairs from [CoRNStack](https://huggingface.co/collections/nomic-ai/cornstack-67c60fda17322ce742fe9dac) (Apache 2.0)
- 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K)
- Loss: 2.42 → 0.63
**Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)**
- 9,959 reasoning-intensive code search queries generated from CoRNStack code samples
- Queries require understanding algorithms, edge cases, design patterns, and complexity
- Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology)
- Loss: 2.36 → 0.54
### Training Configuration
```python
# Both stages
model = ColBERT(document_length=512, query_length=128)
loss = CachedContrastive(temperature=1.0, mini_batch_size=32)
batch_size = 256
optim = "adamw_torch"
bf16 = True
# Stage 1: lr=1e-5, 1 epoch, warmup=5%
# Stage 2: lr=5e-6, 3 epochs, warmup=5%
```
### Hardware
Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory).
- Stage 1: ~130 min (391 steps)
- Stage 2: ~37 min (117 steps)
## Benchmark Results
### CodeSearchNet MRR (500 queries per language, 500 candidates)
| Language | GTE-ModernColBERT (base) | **Reason-Code-ModernColBERT (ours)** | Δ |
|------------|:---:|:---:|:---:|
| Python | 0.991 | 0.989 | -0.002 |
| Java | 0.829 | **0.866** | +0.037 |
| JavaScript | 0.802 | **0.839** | +0.037 |
| PHP | 0.841 | **0.862** | +0.021 |
| Go | 0.879 | **0.887** | +0.008 |
| Ruby | 0.773 | **0.831** | +0.058 |
| **Average** | 0.853 | **0.879** | **+0.026** |
Improves on the base model in 5 of 6 languages. Largest gains in Ruby (+5.8pp), Java (+3.7pp), and JavaScript (+3.7pp) — languages that benefited most from reasoning-enhanced training data. Python is near-ceiling at 0.99.
## Usage
```python
from pylate import models
model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT")
queries = ["function that sorts an array in descending order using a comparison-based algorithm"]
code_docs = ["def sort_desc(arr):\n return sorted(arr, reverse=True)"]
query_embeddings = model.encode(queries, is_query=True)
doc_embeddings = model.encode(code_docs, is_query=False)
```
## Citation
This model extends the methodology from:
```bibtex
@article{shao2025reasonir,
title={ReasonIR: Training Retrievers for Reasoning Tasks},
author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori},
journal={arXiv preprint arXiv:2504.20595},
year={2025}
}
@misc{Reason-ModernColBERT,
title={Reason-ModernColBERT},
author={LightOn AI},
year={2025},
url={https://huggingface.co/lightonai/Reason-ModernColBERT}
}
@inproceedings{cornstack2025,
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
author={Gangisetty, Zach and others},
booktitle={ICLR},
year={2025}
}
```
Built with [PyLate](https://github.com/lightonai/pylate) and [Sentence Transformers](https://www.sbert.net/).
|