Model Card for Model ID
codepulse-codebert
Fine-tuned binary classifier on top of microsoft/codebert-base that
scores code snippets by P(buggy). Used in the CodePulse analysis engine
as a confidence validator: it filters GPT-predicted bugs by checking
whether the flagged line is statistically likely to be buggy, reducing
false positives before they reach the end user.
Model Details
Model Description
CodePulse-CodeBERT is a binary sequence classifier fine-tuned from
microsoft/codebert-base. Given a short code snippet (typically one bug
line plus optional surrounding context), the model outputs a probability
that the snippet contains a bug. Predictions below a configurable
threshold are marked as low-confidence and excluded from the final
quality score.
- Developed by: Aiden Cary, Keller Willhite, Zachery Atchley
- Model type: Transformer-based binary sequence classifier (CodeBERT fine-tune)
- Language(s) (NLP): Code (Python primary)
- License: MIT
- Finetuned from model: microsoft/codebert-base
Model Sources
- Repository: https://github.com/aidencary/CodePulse
Uses
Direct Use
Classify short code snippets as buggy or not buggy:
from transformers import pipeline
clf = pipeline("text-classification", model="aidencary/codepulse-codebert")
result = clf("return user_list[index]")
# [{'label': 'buggy', 'score': 0.87}]
Downstream Use
Integrated into the CodePulse backend
(app/services/codebert_validator.py) as a post-processing layer over
GPT-generated bug predictions. Each predicted bug line is extracted,
comment-stripped, and scored. Bugs whose P(buggy) falls below the
configured threshold are flagged and excluded from the penalty applied
to the code quality score.
Out-of-Scope Use
- Full-file classification --- model expects single-line or short-window snippets (≤512 tokens). Long inputs are truncated.
- Languages other than Python --- training data was Python-focused; results on other languages are unreliable.
- Security vulnerability detection --- trained for general bug patterns, not security-specific flaws (SQLi, XSS, etc.).
- Production safety gate without human review --- false negative rate is non-zero.
Bias, Risks, and Limitations
- Training data skews toward certain bug patterns; rare bug types will have lower recall.
- Comment stripping is applied at inference time (inline
# ...comments are removed before scoring) to prevent label leakage from annotated datasets. Code with semantically meaningful comments may lose signal. - Confidence contrast remapping is applied in the CodePulse pipeline --- raw model probabilities are spread apart via a sigmoid transform before thresholding. Direct use of the model outside that pipeline will see unmodified softmax probabilities.
Recommendations
Use P(buggy) as a soft signal, not a hard gate. Combine with static analysis or human review for critical codepaths.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained("aidencary/codepulse-codebert")
model = AutoModelForSequenceClassification.from_pretrained("aidencary/codepulse-codebert")
model.eval()
snippet = "items[i] = value"
inputs = tokenizer(snippet, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
p_buggy = float(F.softmax(logits, dim=-1)[0][model.config.label2id["buggy"]])
print(f"P(buggy): {p_buggy:.3f}")
Training Details
Training Data
Fine-tuned on labeled code snippets where each sample is a short code line or block annotated as buggy or clean. Training data sourced from public bug datasets and synthetic bug injection into clean Python code.
Training Procedure
Preprocessing
- Inline
#comments stripped to prevent label leakage - Common leading indentation removed (dedented to column 0)
- Tokenized with microsoft/codebert-base tokenizer, max length 512
Training Hyperparameters
- Training regime: fp32
- Base model: microsoft/codebert-base
- Task head: AutoModelForSequenceClassification (2 labels)
Evaluation
Testing Data, Factors & Metrics
Testing Data
Held-out split from the same labeled snippet dataset used for training.
Metrics
- Accuracy
- F1 (macro)
- P(buggy) calibration --- model confidence should correlate with actual bug rate
Results
Metric Value
Accuracy [add yours] F1 (macro) [add yours]
Summary
Model performs well on Python snippets matching training distribution. Performance degrades on heavily commented code (comments stripped at inference) and on languages outside the training set.
Technical Specifications
Model Architecture and Objective
RobertaForSequenceClassification (CodeBERT backbone) with a 2-class classification head. Objective: binary cross-entropy, labels = {clean, buggy}.
Compute Infrastructure
Hardware
Consumer GPU (training)
Software
- transformers
- torch
- Python 3.11+
Model Card Authors
Aiden Cary, Keller Willhite, Zachery Atchley
Model Card Contact
- Downloads last month
- 255
Model tree for aidencary/codepulse-codebert
Base model
microsoft/codebert-base