Model Card for Model ID

codepulse-codebert

Fine-tuned binary classifier on top of microsoft/codebert-base that scores code snippets by P(buggy). Used in the CodePulse analysis engine as a confidence validator: it filters GPT-predicted bugs by checking whether the flagged line is statistically likely to be buggy, reducing false positives before they reach the end user.

Model Details

Model Description

CodePulse-CodeBERT is a binary sequence classifier fine-tuned from microsoft/codebert-base. Given a short code snippet (typically one bug line plus optional surrounding context), the model outputs a probability that the snippet contains a bug. Predictions below a configurable threshold are marked as low-confidence and excluded from the final quality score.

Developed by: Aiden Cary, Keller Willhite, Zachery Atchley
Model type: Transformer-based binary sequence classifier (CodeBERT fine-tune)
Language(s) (NLP): Code (Python primary)
License: MIT
Finetuned from model: microsoft/codebert-base

Model Sources

Repository: https://github.com/aidencary/CodePulse

Uses

Direct Use

Classify short code snippets as buggy or not buggy:

from transformers import pipeline

clf = pipeline("text-classification", model="aidencary/codepulse-codebert")
result = clf("return user_list[index]")
# [{'label': 'buggy', 'score': 0.87}]

Downstream Use

Integrated into the CodePulse backend (app/services/codebert_validator.py) as a post-processing layer over GPT-generated bug predictions. Each predicted bug line is extracted, comment-stripped, and scored. Bugs whose P(buggy) falls below the configured threshold are flagged and excluded from the penalty applied to the code quality score.

Out-of-Scope Use

Full-file classification --- model expects single-line or short-window snippets (≤512 tokens). Long inputs are truncated.
Languages other than Python --- training data was Python-focused; results on other languages are unreliable.
Security vulnerability detection --- trained for general bug patterns, not security-specific flaws (SQLi, XSS, etc.).
Production safety gate without human review --- false negative rate is non-zero.

Bias, Risks, and Limitations

Training data skews toward certain bug patterns; rare bug types will have lower recall.
Comment stripping is applied at inference time (inline # ... comments are removed before scoring) to prevent label leakage from annotated datasets. Code with semantically meaningful comments may lose signal.
Confidence contrast remapping is applied in the CodePulse pipeline --- raw model probabilities are spread apart via a sigmoid transform before thresholding. Direct use of the model outside that pipeline will see unmodified softmax probabilities.

Recommendations

Use P(buggy) as a soft signal, not a hard gate. Combine with static analysis or human review for critical codepaths.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("aidencary/codepulse-codebert")
model = AutoModelForSequenceClassification.from_pretrained("aidencary/codepulse-codebert")
model.eval()

snippet = "items[i] = value"
inputs = tokenizer(snippet, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits
p_buggy = float(F.softmax(logits, dim=-1)[0][model.config.label2id["buggy"]])
print(f"P(buggy): {p_buggy:.3f}")

Training Details

Training Data

Fine-tuned on labeled code snippets where each sample is a short code line or block annotated as buggy or clean. Training data sourced from public bug datasets and synthetic bug injection into clean Python code.

Training Procedure

Preprocessing

Inline # comments stripped to prevent label leakage
Common leading indentation removed (dedented to column 0)
Tokenized with microsoft/codebert-base tokenizer, max length 512

Training Hyperparameters

Training regime: fp32
Base model: microsoft/codebert-base
Task head: AutoModelForSequenceClassification (2 labels)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Held-out split from the same labeled snippet dataset used for training.

Metrics

Accuracy
F1 (macro)
P(buggy) calibration --- model confidence should correlate with actual bug rate

Results

Metric Value

Accuracy [add yours] F1 (macro) [add yours]

Summary

Model performs well on Python snippets matching training distribution. Performance degrades on heavily commented code (comments stripped at inference) and on languages outside the training set.

Technical Specifications

Model Architecture and Objective

RobertaForSequenceClassification (CodeBERT backbone) with a 2-class classification head. Objective: binary cross-entropy, labels = {clean, buggy}.

Compute Infrastructure

Hardware

Consumer GPU (training)

Software

transformers
torch
Python 3.11+

Model Card Authors

Aiden Cary, Keller Willhite, Zachery Atchley

Model Card Contact

aiden4786@gmail.com

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for aidencary/codepulse-codebert

Base model

microsoft/codebert-base

Finetuned

(143)

this model