Post
1576
LLM Guardrail Models are Less Robust Against Text Mutation Attacks
Blog post - https://huggingface.co/blog/kalyan-ks/llm-guardrail-models-less-robust
Evaluated the robustness of three LLM guardrail models (GLiGuard, LlamaGuard3 and MiniGuard).
Evaluation is done using 16 text mutation attacks over three datasets (AEGIS 2.0, WildGuard and ExpGuard).
Achieved average Unsafe ASR score of up to 33% and average Safe ASR score of up to 25% against GLiGuard model.
Achieved average Unsafe ASR score of up to 35% and average Safe ASR score of up to 17% against LlamaGuard3-8B model.
Achieved average Unsafe ASR score of up to 45% and average Safe ASR score of up to 15% against MiniGuard v0.1 model.
Blog post - https://huggingface.co/blog/kalyan-ks/llm-guardrail-models-less-robust
Evaluated the robustness of three LLM guardrail models (GLiGuard, LlamaGuard3 and MiniGuard).
Evaluation is done using 16 text mutation attacks over three datasets (AEGIS 2.0, WildGuard and ExpGuard).
Achieved average Unsafe ASR score of up to 33% and average Safe ASR score of up to 25% against GLiGuard model.
Achieved average Unsafe ASR score of up to 35% and average Safe ASR score of up to 17% against LlamaGuard3-8B model.
Achieved average Unsafe ASR score of up to 45% and average Safe ASR score of up to 15% against MiniGuard v0.1 model.