File size: 5,656 Bytes

26acd1b
 
 
 
 
 
ad86b43

---
license: mit
language:
- en
base_model:
- microsoft/codebert-base
pipeline_tag: text-classification
tags:
- code-quality
- bug-detection
- codebert
- python
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

# codepulse-codebert

Fine-tuned binary classifier on top of `microsoft/codebert-base` that
scores code snippets by P(buggy). Used in the CodePulse analysis engine
as a confidence validator: it filters GPT-predicted bugs by checking
whether the flagged line is statistically likely to be buggy, reducing
false positives before they reach the end user.

## Model Details

### Model Description

CodePulse-CodeBERT is a binary sequence classifier fine-tuned from
`microsoft/codebert-base`. Given a short code snippet (typically one bug
line plus optional surrounding context), the model outputs a probability
that the snippet contains a bug. Predictions below a configurable
threshold are marked as low-confidence and excluded from the final
quality score.

-   **Developed by:** Aiden Cary, Keller Willhite, Zachery Atchley
-   **Model type:** Transformer-based binary sequence classifier
    (CodeBERT fine-tune)
-   **Language(s) (NLP):** Code (Python primary)
-   **License:** MIT
-   **Finetuned from model:**
    [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)

### Model Sources

-   **Repository:** https://github.com/aidencary/CodePulse

## Uses

### Direct Use

Classify short code snippets as buggy or not buggy:

``` python
from transformers import pipeline

clf = pipeline("text-classification", model="aidencary/codepulse-codebert")
result = clf("return user_list[index]")
# [{'label': 'buggy', 'score': 0.87}]
```

### Downstream Use

Integrated into the CodePulse backend
(`app/services/codebert_validator.py`) as a post-processing layer over
GPT-generated bug predictions. Each predicted bug line is extracted,
comment-stripped, and scored. Bugs whose P(buggy) falls below the
configured threshold are flagged and excluded from the penalty applied
to the code quality score.

### Out-of-Scope Use

-   Full-file classification --- model expects single-line or
    short-window snippets (≤512 tokens). Long inputs are truncated.
-   Languages other than Python --- training data was Python-focused;
    results on other languages are unreliable.
-   Security vulnerability detection --- trained for general bug
    patterns, not security-specific flaws (SQLi, XSS, etc.).
-   Production safety gate without human review --- false negative rate
    is non-zero.

## Bias, Risks, and Limitations

-   Training data skews toward certain bug patterns; rare bug types will
    have lower recall.
-   Comment stripping is applied at inference time (inline `# ...`
    comments are removed before scoring) to prevent label leakage from
    annotated datasets. Code with semantically meaningful comments may
    lose signal.
-   Confidence contrast remapping is applied in the CodePulse pipeline
    --- raw model probabilities are spread apart via a sigmoid transform
    before thresholding. Direct use of the model outside that pipeline
    will see unmodified softmax probabilities.

## Recommendations

Use P(buggy) as a soft signal, not a hard gate. Combine with static
analysis or human review for critical codepaths.

## How to Get Started with the Model

``` python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("aidencary/codepulse-codebert")
model = AutoModelForSequenceClassification.from_pretrained("aidencary/codepulse-codebert")
model.eval()

snippet = "items[i] = value"
inputs = tokenizer(snippet, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits
p_buggy = float(F.softmax(logits, dim=-1)[0][model.config.label2id["buggy"]])
print(f"P(buggy): {p_buggy:.3f}")
```

## Training Details

### Training Data

Fine-tuned on labeled code snippets where each sample is a short code
line or block annotated as buggy or clean. Training data sourced from
public bug datasets and synthetic bug injection into clean Python code.

### Training Procedure

#### Preprocessing

-   Inline `#` comments stripped to prevent label leakage
-   Common leading indentation removed (dedented to column 0)
-   Tokenized with microsoft/codebert-base tokenizer, max length 512

#### Training Hyperparameters

-   Training regime: fp32
-   Base model: microsoft/codebert-base
-   Task head: AutoModelForSequenceClassification (2 labels)

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

Held-out split from the same labeled snippet dataset used for training.

#### Metrics

-   Accuracy
-   F1 (macro)
-   P(buggy) calibration --- model confidence should correlate with
    actual bug rate

#### Results

  Metric       Value
  ------------ ---------------
  Accuracy     \[add yours\]
  F1 (macro)   \[add yours\]

### Summary

Model performs well on Python snippets matching training distribution.
Performance degrades on heavily commented code (comments stripped at
inference) and on languages outside the training set.

## Technical Specifications

### Model Architecture and Objective

RobertaForSequenceClassification (CodeBERT backbone) with a 2-class
classification head. Objective: binary cross-entropy, labels = {clean,
buggy}.

### Compute Infrastructure

#### Hardware

Consumer GPU (training)

#### Software

-   transformers
-   torch
-   Python 3.11+

## Model Card Authors

Aiden Cary, Keller Willhite, Zachery Atchley

## Model Card Contact

aiden4786@gmail.com