File size: 5,656 Bytes
26acd1b ad86b43 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | ---
license: mit
language:
- en
base_model:
- microsoft/codebert-base
pipeline_tag: text-classification
tags:
- code-quality
- bug-detection
- codebert
- python
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
# codepulse-codebert
Fine-tuned binary classifier on top of `microsoft/codebert-base` that
scores code snippets by P(buggy). Used in the CodePulse analysis engine
as a confidence validator: it filters GPT-predicted bugs by checking
whether the flagged line is statistically likely to be buggy, reducing
false positives before they reach the end user.
## Model Details
### Model Description
CodePulse-CodeBERT is a binary sequence classifier fine-tuned from
`microsoft/codebert-base`. Given a short code snippet (typically one bug
line plus optional surrounding context), the model outputs a probability
that the snippet contains a bug. Predictions below a configurable
threshold are marked as low-confidence and excluded from the final
quality score.
- **Developed by:** Aiden Cary, Keller Willhite, Zachery Atchley
- **Model type:** Transformer-based binary sequence classifier
(CodeBERT fine-tune)
- **Language(s) (NLP):** Code (Python primary)
- **License:** MIT
- **Finetuned from model:**
[microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)
### Model Sources
- **Repository:** https://github.com/aidencary/CodePulse
## Uses
### Direct Use
Classify short code snippets as buggy or not buggy:
``` python
from transformers import pipeline
clf = pipeline("text-classification", model="aidencary/codepulse-codebert")
result = clf("return user_list[index]")
# [{'label': 'buggy', 'score': 0.87}]
```
### Downstream Use
Integrated into the CodePulse backend
(`app/services/codebert_validator.py`) as a post-processing layer over
GPT-generated bug predictions. Each predicted bug line is extracted,
comment-stripped, and scored. Bugs whose P(buggy) falls below the
configured threshold are flagged and excluded from the penalty applied
to the code quality score.
### Out-of-Scope Use
- Full-file classification --- model expects single-line or
short-window snippets (≤512 tokens). Long inputs are truncated.
- Languages other than Python --- training data was Python-focused;
results on other languages are unreliable.
- Security vulnerability detection --- trained for general bug
patterns, not security-specific flaws (SQLi, XSS, etc.).
- Production safety gate without human review --- false negative rate
is non-zero.
## Bias, Risks, and Limitations
- Training data skews toward certain bug patterns; rare bug types will
have lower recall.
- Comment stripping is applied at inference time (inline `# ...`
comments are removed before scoring) to prevent label leakage from
annotated datasets. Code with semantically meaningful comments may
lose signal.
- Confidence contrast remapping is applied in the CodePulse pipeline
--- raw model probabilities are spread apart via a sigmoid transform
before thresholding. Direct use of the model outside that pipeline
will see unmodified softmax probabilities.
## Recommendations
Use P(buggy) as a soft signal, not a hard gate. Combine with static
analysis or human review for critical codepaths.
## How to Get Started with the Model
``` python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained("aidencary/codepulse-codebert")
model = AutoModelForSequenceClassification.from_pretrained("aidencary/codepulse-codebert")
model.eval()
snippet = "items[i] = value"
inputs = tokenizer(snippet, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
p_buggy = float(F.softmax(logits, dim=-1)[0][model.config.label2id["buggy"]])
print(f"P(buggy): {p_buggy:.3f}")
```
## Training Details
### Training Data
Fine-tuned on labeled code snippets where each sample is a short code
line or block annotated as buggy or clean. Training data sourced from
public bug datasets and synthetic bug injection into clean Python code.
### Training Procedure
#### Preprocessing
- Inline `#` comments stripped to prevent label leakage
- Common leading indentation removed (dedented to column 0)
- Tokenized with microsoft/codebert-base tokenizer, max length 512
#### Training Hyperparameters
- Training regime: fp32
- Base model: microsoft/codebert-base
- Task head: AutoModelForSequenceClassification (2 labels)
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Held-out split from the same labeled snippet dataset used for training.
#### Metrics
- Accuracy
- F1 (macro)
- P(buggy) calibration --- model confidence should correlate with
actual bug rate
#### Results
Metric Value
------------ ---------------
Accuracy \[add yours\]
F1 (macro) \[add yours\]
### Summary
Model performs well on Python snippets matching training distribution.
Performance degrades on heavily commented code (comments stripped at
inference) and on languages outside the training set.
## Technical Specifications
### Model Architecture and Objective
RobertaForSequenceClassification (CodeBERT backbone) with a 2-class
classification head. Objective: binary cross-entropy, labels = {clean,
buggy}.
### Compute Infrastructure
#### Hardware
Consumer GPU (training)
#### Software
- transformers
- torch
- Python 3.11+
## Model Card Authors
Aiden Cary, Keller Willhite, Zachery Atchley
## Model Card Contact
aiden4786@gmail.com |