File size: 4,999 Bytes
b723c73
18981cb
 
 
 
 
b723c73
 
 
 
18981cb
 
 
 
 
 
 
 
 
 
 
 
b723c73
 
 
18981cb
b723c73
2e8b74a
b723c73
2e8b74a
b723c73
06d103f
b723c73
06d103f
b723c73
18981cb
b723c73
18981cb
 
 
 
 
 
 
 
 
 
 
b723c73
18981cb
b723c73
18981cb
b723c73
18981cb
 
 
 
b723c73
18981cb
 
 
 
 
b723c73
18981cb
b723c73
 
18981cb
 
 
 
 
 
 
 
 
b723c73
 
18981cb
b723c73
18981cb
 
 
b723c73
b9a34e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18981cb
b723c73
 
18981cb
b723c73
18981cb
b723c73
18981cb
 
b723c73
18981cb
 
b723c73
 
 
 
18981cb
b723c73
 
18981cb
 
 
 
 
b723c73
 
18981cb
 
 
 
 
b723c73
 
18981cb
 
 
 
 
b723c73
 
 
18981cb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: apache-2.0
language:
- en
- code
library_name: PyLate
tags:
- ColBERT
- PyLate
- sentence-transformers
- code-search
- code-retrieval
- late-interaction
- reasoning
base_model: lightonai/GTE-ModernColBERT-v1
datasets:
- nomic-ai/cornstack-python-v1
- nomic-ai/cornstack-java-v1
- nomic-ai/cornstack-javascript-v1
- nomic-ai/cornstack-php-v1
- nomic-ai/cornstack-go-v1
- nomic-ai/cornstack-ruby-v1
pipeline_tag: sentence-similarity
---

# Reason-Code-ModernColBERT

The **first reasoning-enhanced ColBERT model for code search and retrieval**.

Extends the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. Built on research from [LightOn AI](https://huggingface.co/lightonai) (ColBERT for code) and [Facebook Research](https://github.com/facebookresearch/ReasonIR) (reasoning-enhanced retrieval).

## Why Reasoning-Enhanced Training for Code?

Standard code search training uses docstring→code pairs. Our approach generates **reasoning-intensive queries** that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) to outperform 7B dense models on reasoning tasks at only 150M parameters.

## Model Details

| Property | Value |
|---|---|
| **Base model** | [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) |
| **Architecture** | ColBERT (late-interaction, multi-vector) |
| **Parameters** | 150M |
| **Embedding dim** | 128 per token |
| **Document length** | 512 tokens |
| **Query length** | 128 tokens |
| **Similarity** | MaxSim |
| **Languages** | Python, Java, JavaScript, PHP, Go, Ruby |
| **License** | Apache 2.0 |

## Training

### Two-Stage Training Pipeline

**Stage 1: CoRNStack Base (1 epoch)**
- 100,000 high-quality code search pairs from [CoRNStack](https://huggingface.co/collections/nomic-ai/cornstack-67c60fda17322ce742fe9dac) (Apache 2.0)
- 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K)
- Loss: 2.42 → 0.63

**Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)**
- 9,959 reasoning-intensive code search queries generated from CoRNStack code samples
- Queries require understanding algorithms, edge cases, design patterns, and complexity
- Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology)
- Loss: 2.36 → 0.54

### Training Configuration

```python
# Both stages
model = ColBERT(document_length=512, query_length=128)
loss = CachedContrastive(temperature=1.0, mini_batch_size=32)
batch_size = 256
optim = "adamw_torch"
bf16 = True

# Stage 1: lr=1e-5, 1 epoch, warmup=5%
# Stage 2: lr=5e-6, 3 epochs, warmup=5%
```

### Hardware

Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory).
- Stage 1: ~130 min (391 steps)
- Stage 2: ~37 min (117 steps)

## Benchmark Results

### CodeSearchNet MRR (500 queries per language, 500 candidates)

| Language   | GTE-ModernColBERT (base) | **Reason-Code-ModernColBERT (ours)** | Δ |
|------------|:---:|:---:|:---:|
| Python     | 0.991 | 0.989 | -0.002 |
| Java       | 0.829 | **0.866** | +0.037 |
| JavaScript | 0.802 | **0.839** | +0.037 |
| PHP        | 0.841 | **0.862** | +0.021 |
| Go         | 0.879 | **0.887** | +0.008 |
| Ruby       | 0.773 | **0.831** | +0.058 |
| **Average** | 0.853 | **0.879** | **+0.026** |

Improves on the base model in 5 of 6 languages. Largest gains in Ruby (+5.8pp), Java (+3.7pp), and JavaScript (+3.7pp) — languages that benefited most from reasoning-enhanced training data. Python is near-ceiling at 0.99.

## Usage

```python
from pylate import models

model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT")

queries = ["function that sorts an array in descending order using a comparison-based algorithm"]
code_docs = ["def sort_desc(arr):\n    return sorted(arr, reverse=True)"]

query_embeddings = model.encode(queries, is_query=True)
doc_embeddings = model.encode(code_docs, is_query=False)
```

## Citation

This model extends the methodology from:

```bibtex
@article{shao2025reasonir,
  title={ReasonIR: Training Retrievers for Reasoning Tasks},
  author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori},
  journal={arXiv preprint arXiv:2504.20595},
  year={2025}
}

@misc{Reason-ModernColBERT,
  title={Reason-ModernColBERT},
  author={LightOn AI},
  year={2025},
  url={https://huggingface.co/lightonai/Reason-ModernColBERT}
}

@inproceedings{cornstack2025,
  title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
  author={Gangisetty, Zach and others},
  booktitle={ICLR},
  year={2025}
}
```

Built with [PyLate](https://github.com/lightonai/pylate) and [Sentence Transformers](https://www.sbert.net/).