File size: 5,656 Bytes
26acd1b
 
 
 
 
 
ad86b43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: mit
language:
- en
base_model:
- microsoft/codebert-base
pipeline_tag: text-classification
tags:
- code-quality
- bug-detection
- codebert
- python
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

# codepulse-codebert

Fine-tuned binary classifier on top of `microsoft/codebert-base` that
scores code snippets by P(buggy). Used in the CodePulse analysis engine
as a confidence validator: it filters GPT-predicted bugs by checking
whether the flagged line is statistically likely to be buggy, reducing
false positives before they reach the end user.

## Model Details

### Model Description

CodePulse-CodeBERT is a binary sequence classifier fine-tuned from
`microsoft/codebert-base`. Given a short code snippet (typically one bug
line plus optional surrounding context), the model outputs a probability
that the snippet contains a bug. Predictions below a configurable
threshold are marked as low-confidence and excluded from the final
quality score.

-   **Developed by:** Aiden Cary, Keller Willhite, Zachery Atchley
-   **Model type:** Transformer-based binary sequence classifier
    (CodeBERT fine-tune)
-   **Language(s) (NLP):** Code (Python primary)
-   **License:** MIT
-   **Finetuned from model:**
    [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)

### Model Sources

-   **Repository:** https://github.com/aidencary/CodePulse

## Uses

### Direct Use

Classify short code snippets as buggy or not buggy:

``` python
from transformers import pipeline

clf = pipeline("text-classification", model="aidencary/codepulse-codebert")
result = clf("return user_list[index]")
# [{'label': 'buggy', 'score': 0.87}]
```

### Downstream Use

Integrated into the CodePulse backend
(`app/services/codebert_validator.py`) as a post-processing layer over
GPT-generated bug predictions. Each predicted bug line is extracted,
comment-stripped, and scored. Bugs whose P(buggy) falls below the
configured threshold are flagged and excluded from the penalty applied
to the code quality score.

### Out-of-Scope Use

-   Full-file classification --- model expects single-line or
    short-window snippets (≤512 tokens). Long inputs are truncated.
-   Languages other than Python --- training data was Python-focused;
    results on other languages are unreliable.
-   Security vulnerability detection --- trained for general bug
    patterns, not security-specific flaws (SQLi, XSS, etc.).
-   Production safety gate without human review --- false negative rate
    is non-zero.

## Bias, Risks, and Limitations

-   Training data skews toward certain bug patterns; rare bug types will
    have lower recall.
-   Comment stripping is applied at inference time (inline `# ...`
    comments are removed before scoring) to prevent label leakage from
    annotated datasets. Code with semantically meaningful comments may
    lose signal.
-   Confidence contrast remapping is applied in the CodePulse pipeline
    --- raw model probabilities are spread apart via a sigmoid transform
    before thresholding. Direct use of the model outside that pipeline
    will see unmodified softmax probabilities.

## Recommendations

Use P(buggy) as a soft signal, not a hard gate. Combine with static
analysis or human review for critical codepaths.

## How to Get Started with the Model

``` python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("aidencary/codepulse-codebert")
model = AutoModelForSequenceClassification.from_pretrained("aidencary/codepulse-codebert")
model.eval()

snippet = "items[i] = value"
inputs = tokenizer(snippet, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits
p_buggy = float(F.softmax(logits, dim=-1)[0][model.config.label2id["buggy"]])
print(f"P(buggy): {p_buggy:.3f}")
```

## Training Details

### Training Data

Fine-tuned on labeled code snippets where each sample is a short code
line or block annotated as buggy or clean. Training data sourced from
public bug datasets and synthetic bug injection into clean Python code.

### Training Procedure

#### Preprocessing

-   Inline `#` comments stripped to prevent label leakage
-   Common leading indentation removed (dedented to column 0)
-   Tokenized with microsoft/codebert-base tokenizer, max length 512

#### Training Hyperparameters

-   Training regime: fp32
-   Base model: microsoft/codebert-base
-   Task head: AutoModelForSequenceClassification (2 labels)

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

Held-out split from the same labeled snippet dataset used for training.

#### Metrics

-   Accuracy
-   F1 (macro)
-   P(buggy) calibration --- model confidence should correlate with
    actual bug rate

#### Results

  Metric       Value
  ------------ ---------------
  Accuracy     \[add yours\]
  F1 (macro)   \[add yours\]

### Summary

Model performs well on Python snippets matching training distribution.
Performance degrades on heavily commented code (comments stripped at
inference) and on languages outside the training set.

## Technical Specifications

### Model Architecture and Objective

RobertaForSequenceClassification (CodeBERT backbone) with a 2-class
classification head. Objective: binary cross-entropy, labels = {clean,
buggy}.

### Compute Infrastructure

#### Hardware

Consumer GPU (training)

#### Software

-   transformers
-   torch
-   Python 3.11+

## Model Card Authors

Aiden Cary, Keller Willhite, Zachery Atchley

## Model Card Contact

aiden4786@gmail.com