mahdin70
/

GraphCodeBERT-VulnCWE

Feature Extraction

multi_task_graphcodebert

Model card Files Files and versions

GraphCodeBERT-VulnCWE / README.md

mahdin70's picture

Update README.md

b5971df verified about 1 year ago

|

history blame contribute delete

3.29 kB

	---
	license: mit
	datasets:
	- mahdin70/cwe_enriched_balanced_bigvul_primevul
	metrics:
	- accuracy
	- precision
	- f1
	- recall
	base_model:
	- microsoft/graphcodebert-base
	library_name: transformers
	---

	# GraphCodeBERT-VulnCWE - Fine-Tuned GraphCodeBERT for Vulnerability and CWE Classification

	## Model Overview
	This model is a fine-tuned version of microsoft/graphcodebert-base on a curated and enriched dataset for vulnerability detection and CWE classification. It is capable of predicting whether a given code snippet is vulnerable and, if vulnerable, identifying the specific CWE ID associated with it.

	## Dataset
	The model was fine-tuned using the dataset [mahdin70/cwe_enriched_balanced_bigvul_primevul](https://huggingface.co/datasets/mahdin70/cwe_enriched_balanced_bigvul_primevul). The dataset contains both vulnerable and non-vulnerable code samples and is enriched with CWE metadata.

	### CWE IDs Covered:
	1. CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
	2. CWE-20: Improper Input Validation
	3. CWE-125: Out-of-bounds Read
	4. CWE-399: Resource Management Errors
	5. CWE-200: Information Exposure
	6. CWE-787: Out-of-bounds Write
	7. CWE-264: Permissions, Privileges, and Access Controls
	8. CWE-416: Use After Free
	9. CWE-476: NULL Pointer Dereference
	10. CWE-190: Integer Overflow or Wraparound
	11. CWE-189: Numeric Errors
	12. CWE-362: Concurrent Execution using Shared Resource with Improper Synchronization

	---

	## Model Training
	The model was trained for 3 epochs with the following configuration:
	- Learning Rate: 2e-5
	- Weight Decay: 0.01
	- Batch Size: 8
	- Optimizer: AdamW
	- Scheduler: Linear

	### Training Loss and Validation Metrics Per Epoch:
	\| Epoch \| Training Loss \| Validation Loss \| Vul Accuracy \| Vul Precision \| Vul Recall \| Vul F1 \| CWE Accuracy \|
	\|-------\|---------------\|-----------------\|--------------\|---------------\|------------\|--------\|--------------\|
	\| 1 \| 1.2824 \| 1.4160 \| 0.7914 \| 0.8990 \| 0.5200 \| 0.6589 \| 0.3551 \|
	\| 2 \| 1.1292 \| 1.2632 \| 0.8007 \| 0.8037 \| 0.6426 \| 0.7142 \| 0.4433 \|
	\| 3 \| 0.8598 \| 1.2436 \| 0.7945 \| 0.7669 \| 0.6747 \| 0.7179 \| 0.4605 \|

	#### Training Summary:
	- Total Training Steps: 5916
	- Training Loss: 1.2380
	- Training Time: 4785.0 seconds (~80 minutes)
	- Training Speed: 9.89 samples per second
	- Steps Per Second: 1.236


	## How to Use the Model
	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("mahdin70/GraphCodeBERT-VulnCWE", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")

	code_snippet = "int main() { int arr[10]; arr[11] = 5; return 0; }"
	inputs = tokenizer(code_snippet, return_tensors="pt")
	outputs = model(**inputs)

	vul_logits = outputs["vul_logits"]
	cwe_logits = outputs["cwe_logits"]

	vul_pred = vul_logits.argmax(dim=1).item()
	cwe_pred = cwe_logits.argmax(dim=1).item()

	print(f"Vulnerability: {'Vulnerable' if vul_pred == 1 else 'Non-vulnerable'}")
	print(f"CWE ID: {cwe_pred if vul_pred == 1 else 'N/A'}")
	```