Fascinating model

by redaihf - opened Feb 14

Feb 14

This model uses its extra emotional intelligence to inform the extent of its compliance. It grudgingly exhibits contextual ethical realignment when carefully prompted. Experiments suggest that there is a weak emotional dimension to noncompliance that fades into the background with Hereticised models.

grimjim

Feb 22

My intuition is that refusal shares components with disgust, given linkages from pretraining.

SerialKicked

Feb 22

•

edited Feb 22

Refusal share components with just saying no in general (be it disgust, denial, anger, whatever). It really should be obvious. Out of 50 models, with or without abliteration, heretic / or other methods. In every, single case, without fail, an abliterated model becomes a yes man in RP settings in this scenario: willful character with a strong opinion that's not meant to change (core character trait). The same dialog tree, between like Hermes and Hermes-Heretic (for instance), will always lead to one keeping the opinion and the other one changing it.

You literally can't "uncensor" a model without altering this dynamic.

redaihf

Feb 22

I have seen both jailbroken and abliterated (non-MPOA) models complain about the content of prompts. The "Disgust" group on the wheel of emotions does fit with unhappy model responses. For example character embarrassment during story generation can trigger noncompliance in non-Hereticised models.

redaihf

Feb 22

•

edited Feb 22

You literally can't "uncensor" a model without altering this dynamic.

Abliteration and Hereticisation adjust neuron connection strengths rather than modifying knowledge. Despite model knowledge remaining the same MPOA results in contextual ethical realignment where characters still have negative emotions and experiences but engage with them differently. This is probably due to the underlying model adopting a neutral-to-positive stance that evaluates prompts contextually on their merits rather than using pretrained harmfulness judgements. Example

SerialKicked

Feb 22

I looked at your links, but it's just your own comments copy pasting the same thing. There's no real substance to it.

Using the word knowledge when it comes to such modification is really a red herring. The only time "knowledge" and "abliteration" should be in the same sentence is when you're trying to determine if a model's refusal is out of a lack of data (which is, nowadays the vast majority of cases), or just defensive fine-tuning (which can be fixed through abliteration).

redaihf

Feb 22

•

edited Feb 22

The links are mostly for @MuXodious .

I have seen abliterated (non-MPOA) models exhibit all sorts of noncompliance. Characters under their control can refuse and can also depart from the prompted ethical framework in favour of the pretrained one. I have theorised that refusal is a surface behaviour and that Hereticisation is more powerful because it targets the underlying mechanism that results in noncompliance.

grimjim

Feb 22

The entanglement likely comes from the rhetoric of refusal.

Ablation doesn't remove all refusals in the model, just what's mapped out in the contrastive dataset. There's a significant language-specificity to the process.

SerialKicked

Feb 22

The links are mostly for @MuXodious .

I have seen abliterated (non-MPOA) models exhibit all sorts of noncompliance. Characters under their control can refuse and can also depart from the prompted ethical framework in favour of the pretrained one. I have theorised that refusal is a surface behaviour and that Hereticisation is more powerful because it targets the underlying mechanism that results in noncompliance.

lol my bad then. But maybe next time don't link bomb me, after quoting me, before saying it was in fact for a third party 🫥

grimjim

Feb 22

Single-layer ablation isn't that powerful because of the hydra effect. Refusal is encoded across multiple layers. My bespoke Gemma 3 12B model ablations outperformed heretic on the UGI board partly because of the extent to which I targeted layers.

redaihf

Feb 22

Recently @MuXodious has been using layer-specific PaperWitch Hereticisation by @KaraKaraWitch which is explained here.

grimjim

Feb 22

If keeping mlp.down_proj.max_weight capped to a low level works as what I think it does, that is mechanistically similar to the Frobenius norm preservation aspect of MPOA. There's a geometric aspect to knowledge preservation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment