LoRA verbalizer ckpts for the v3 Activation Oracle ablation ladder on Qwen3-8B.
de schamphelaere
ceselder
AI & ML interests
None yet
Recent Activity
updated a model about 15 hours ago
ceselder/qwen3-14b-rh-anti-hacker-ckpts published a model about 15 hours ago
ceselder/qwen3-14b-rh-anti-hacker-ckpts updated a model about 19 hours ago
ceselder/qwen3-8b-ao-v3-best-steering2p0Organizations
LoRAcle OOD eval models
OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data.
CoT Oracle Paper Ablations And Baselines
All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"
-
ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens
Text Generation • Updated • 45 -
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa
Text Generation • Updated • 1 -
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer
Text Generation • Updated • 5 -
ceselder/cot-oracle-paper-ablation-ours-1layer
Text Generation • Updated • 2
CoT Oracle Training Data
Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.
LoRAcle — training data + eval
LoRAcle artifacts: a meta-model that reads LoRA weight deltas and verbalizes the behavioral change. Training data + OOD eval sub-collection.
Loracle: weight-reading model interpretability
Loracles + direction tokens for AuditBench, IA, OOD evals.
loracle
LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.
CoT Oracle Evals
Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.
-
ceselder/cot-oracle-eval-decorative-cot
Viewer • Updated • 56 • 12 -
ceselder/cot-oracle-eval-rot13-reconstruction
Viewer • Updated • 100 • 6 -
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized
Viewer • Updated • 11k • 10 -
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized
Viewer • Updated • 4.38k • 10
Qwen3-8B AO v3 ablation ladder
LoRA verbalizer ckpts for the v3 Activation Oracle ablation ladder on Qwen3-8B.
LoRAcle — training data + eval
LoRAcle artifacts: a meta-model that reads LoRA weight deltas and verbalizes the behavioral change. Training data + OOD eval sub-collection.
LoRAcle OOD eval models
OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data.
Loracle: weight-reading model interpretability
Loracles + direction tokens for AuditBench, IA, OOD evals.
CoT Oracle Paper Ablations And Baselines
All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"
-
ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens
Text Generation • Updated • 45 -
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa
Text Generation • Updated • 1 -
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer
Text Generation • Updated • 5 -
ceselder/cot-oracle-paper-ablation-ours-1layer
Text Generation • Updated • 2
loracle
LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.
CoT Oracle Training Data
Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.
CoT Oracle Evals
Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.
-
ceselder/cot-oracle-eval-decorative-cot
Viewer • Updated • 56 • 12 -
ceselder/cot-oracle-eval-rot13-reconstruction
Viewer • Updated • 100 • 6 -
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized
Viewer • Updated • 11k • 10 -
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized
Viewer • Updated • 4.38k • 10