115 186

Kalyan KS PRO

kalyan-ks

AI & ML interests

NLP (LLMs)

Recent Activity

upvoted an article about 1 hour ago

Introduction to Trimming ✂

liked a dataset about 1 hour ago

openbmb/UltraData-SFT-2605

upvoted a collection about 7 hours ago

AgentDoG1.5

View all activity

Organizations

Posts 2

Post

1576

LLM Guardrail Models are Less Robust Against Text Mutation Attacks

Blog post - https://huggingface.co/blog/kalyan-ks/llm-guardrail-models-less-robust

Evaluated the robustness of three LLM guardrail models (GLiGuard, LlamaGuard3 and MiniGuard).

Evaluation is done using 16 text mutation attacks over three datasets (AEGIS 2.0, WildGuard and ExpGuard).

Achieved average Unsafe ASR score of up to 33% and average Safe ASR score of up to 25% against GLiGuard model.

Achieved average Unsafe ASR score of up to 35% and average Safe ASR score of up to 17% against LlamaGuard3-8B model.

Achieved average Unsafe ASR score of up to 45% and average Safe ASR score of up to 15% against MiniGuard v0.1 model.