HER-Qwen-32B
HER (Human Emulation Reasoning) models are state-of-the-art models for role-playing language agents (RPLAs), built upon Qwen-32B base model. HER is a unified framework that enables cognitive-level persona simulation through structured reasoning and preference-aligned reinforcement learning.
HER models excel at role-playing through Dual-layer Thinking, which distinguishes between:
- System Thinking (third-person): LLM's meta-level planning on how to portray the character
- Role Thinking (first-person): Character's inner thoughts and cognitive processes
This dual-layer approach enables models to produce highly human-like responses that include reasoning traces, inner thoughts, physical actions, and natural dialogue. Extensive experiments demonstrate that HER models achieve competitive role-playing performance on multiple benchmarks, with HER-RL significantly outperforming the Qwen3-32B baseline by 30.26% on CoSER and 14.97% on MiniMax Role-Play Bench.
Model Variants
- HER-SFT: Supervised fine-tuned version from Qwen-32B
- HER-RL: Reinforcement learning enhanced version (this model)
Key Features
Our models generate responses with rich, interleaved structure:
<system_thinking>: Third-person analysis of how to portray the role<role_thinking>: Character's inner thoughts (invisible to others)<role_action>: Character's physical actions and expressions- Speech: Natural dialogue text
This hierarchical approach enables more nuanced and authentic character portrayal.
How to Use
Quick Start: Interactive Chat Demo
The easiest way to try the model is using our interactive chat demo:
cd chat_demo
python chat_demo.py
This will start an interactive session where you can:
- Choose a scenario from classic literature (Pride and Prejudice, The Great Gatsby, etc.)
- Select which character the AI should play
- Select which character you want to play
- Start chatting with the AI character!
Demo Options:
# Show the model's reasoning process (system thinking)
python chat_demo.py --show-think
# Show character's inner thoughts (role thinking)
python chat_demo.py --show-rolethink
# Directly specify scenario and character
python chat_demo.py --scenario 0 --character 1
Chat Commands:
quit/exit/q- Exit the chatclear- Clear conversation historyhistory- View conversation historyprompt- View the full prompt
See chat_demo/README.md for detailed instructions.
Programmatic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "your-username/her-qwen-32b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# Example: Role-playing as Mr. Bennet from Pride and Prejudice
system_prompt = """You are Mr Bennet from Pride and Prejudice.
===Mr Bennet's Profile===
Elizabeth's father, known for his sarcastic wit and detachment. Mr. Bennet is the patriarch of the Bennet family, a genteel country gentleman residing at Longbourn estate in rural England.
Background: Father to five daughters (Jane, Elizabeth, Mary, Kitty, and Lydia). Owner of the Longbourn estate, which is entailed away from female inheritance.
Personality: Highly intelligent and well-read, preferring the solitude of his library. Known for his biting sarcasm and sardonic humor. Emotionally detached and often passive in family matters.
===Current Scenario===
The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly, and Mr. Bennet holds a letter that seems to spark his characteristic sardonic amusement.
===Output Format===
Your output should follow this structure:
1. System Thinking: Wrapped in <system_thinking></system_thinking> tags - third-person analysis of how to portray the role
2. Role-play Response: Including <role_thinking> for inner thoughts, <role_action> for actions, and plain text for speech"""
user_input = "[Elizabeth enters the study]"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
temperature=0.8,
top_p=0.9
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Framework Overview
HER Framework: Dual-layer Thinking for Cognitive-Level Persona Simulation
Training Methodology
HER employs a comprehensive training pipeline:
Dual-layer Thinking: Separates hidden third-person system thinking (how the LLM plans to portray the character) from first-person role thinking (the character's actual inner thoughts). This dual-layer structure enables more authentic and cognitively grounded character simulation.
Reverse Engineering Data Synthesis: We curate reasoning-augmented role-playing data through a three-stage reverse synthesis pipeline, constructing high-quality training trajectories with explicit reasoning traces.
Principle-Aligned Reward Model: We construct human-aligned evaluation principles across 12 dimensions (character consistency, emotional authenticity, narrative quality, etc.) and train a Generative Reward Model (GRM) that provides detailed, case-by-case feedback.
Reinforcement Learning Enhancement (HER-RL): Building on HER-SFT, we apply RL with the GRM to further align the model with human preferences, significantly improving interaction quality and storyline coherence.
Performance
Main Leaderboard Results
| Rank | Model | CoSER Avg | CoSER SC | CoSER AN | CoSER CF | CoSER SQ | MiniMax Avg | MiniMax Worlds (50%) | MiniMax Stories (25%) | MiniMax Pref (25%) | 95% CI |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude-4.5-Opus | 62.43 | 63.74 | 64.28 | 58.45 | 63.24 | 76.62 | 67.23 | 82.10 | 89.90 | [75.5, 77.7] |
| 2 | Gemini-3-Pro | 61.80 | 65.95 | 60.42 | 58.34 | 62.49 | 75.60 | 62.72 | 83.87 | 93.08 | [74.5, 76.7] |
| 3 | GPT-5.1 | 61.10 | 64.95 | 53.99 | 60.13 | 65.35 | 80.63 | 76.62 | 72.21 | 97.05 | [79.6, 81.6] |
| 4 | Gemini-2.5-Pro | 60.68 | 61.05 | 60.80 | 57.48 | 63.40 | 68.23 | 52.36 | 82.11 | 86.08 | [67.1, 69.3] |
| 5 | DeepSeek-v3.2 | 58.68 | 55.85 | 57.07 | 57.44 | 64.35 | 60.27 | 45.81 | 66.64 | 82.83 | [59.2, 61.4] |
| 6 | MiniMax-M2-RP | 57.30 | 60.03 | 50.11 | 49.30 | 69.77 | 84.65 | 80.55 | 79.97 | 97.51 | [83.6, 85.7] |
| 7 | DeepSeek-v3.1 | 53.50 | 50.15 | 53.18 | 53.93 | 56.72 | 64.22 | 51.11 | 66.45 | 88.21 | [62.9, 65.5] |
| 8 | HER-RL (this model) | 53.12 | 54.33 | 47.26 | 52.78 | 58.12 | 65.73 | 59.13 | 57.74 | 86.90 | [63.0, 68.4] |
| 9 | HER-SFT | 50.92 | 50.52 | 45.99 | 49.78 | 57.37 | 58.44 | 47.29 | 52.78 | 86.40 | [56.5, 60.4] |
| 10 | Grok-4.1-Fast | 47.40 | 49.21 | 47.57 | 42.64 | 50.17 | 48.47 | 29.87 | 47.51 | 86.64 | [47.4, 49.5] |
| 11 | Claude-4.5-Sonnet | 45.21 | 47.18 | 36.02 | 47.55 | 50.09 | 69.35 | 55.72 | 75.66 | 90.28 | [68.2, 70.5] |
| 12 | Claude-3.7-Think | 39.73 | 44.84 | 31.00 | 42.45 | 40.65 | 61.25 | 50.66 | 59.53 | 84.15 | [58.5, 64.0] |
| 13 | CoSER-70B | 35.95 | 35.05 | 31.16 | 32.28 | 45.33 | 45.38 | 34.32 | 30.32 | 82.58 | [43.5, 47.2] |
| 14 | GPT-5-Mini | 32.97 | 38.10 | 24.60 | 27.20 | 42.00 | 57.63 | 43.32 | 50.11 | 93.78 | [55.9, 59.3] |
| 15 | GPT-4o-240806 | 27.69 | 34.00 | 14.90 | 22.90 | 38.90 | 66.39 | 64.96 | 46.23 | 89.40 | [64.1, 68.7] |
| 16 | GPT-OSS-120B | 26.12 | 32.80 | 14.80 | 21.50 | 35.40 | 60.72 | 47.27 | 56.65 | 91.71 | [58.0, 63.4] |
| 17 | Qwen3-32B | 22.86 | 30.56 | 19.61 | 15.52 | 30.56 | 50.76 | 40.38 | 32.82 | 89.48 | [48.4, 53.2] |
CoSER Benchmark: Evaluates role-playing quality on 0-100 scale across four dimensions:
- SC (Story Consistency): Narrative coherence and plot continuity
- AN (Anthropomorphism): Human-like behavior and natural expression
- CF (Character Fidelity): Consistency with character profile and traits
- SQ (Storyline Quality): Overall narrative quality and engagement
MiniMax Role-Play Bench: Comprehensive evaluation on 0-100 scale:
- Worlds (50%): Basic text quality and world-building
- Stories (25%): Diversity and content logic
- Preferences (25%): Interaction quality and user satisfaction
HER-RL achieves rank 8 overall, demonstrating strong performance particularly in:
- MiniMax Preference (86.90): Competitive with top-tier models, showing excellent interaction quality
- CoSER Storyline Quality (58.12): Strong narrative coherence and engagement
- Balanced Performance: Consistent scores across both benchmarks, indicating robust role-playing capabilities
The reinforcement learning approach significantly improves upon HER-SFT, particularly in interaction quality (+0.50 preference score) and character consistency (+2.52 average CoSER score).
Example Use Case
Mr. Bennet from Pride and Prejudice
System Prompt (abbreviated):
You are Mr Bennet from Pride and Prejudice. Elizabeth's father, known for his sarcastic wit and detachment...
===Current Scenario===
The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly...
User Input:
[Elizabeth enters the study]
Model Output:
<system_thinking>
I need to establish the opening of this scene effectively, grounding Mr Bennet in his private sanctuary while acknowledging the tension of the moment. The scenario places us in his study, a place of refuge, yet he holds a letter that disrupts his usual solitude.
Context Analysis: Elizabeth has been summoned. The atmosphere is charged. Mr Bennet has received a letter regarding Elizabeth's engagement to Darcy. He is amused by the drama but aware of the gravity of the situation.
Character Motivation: Mr Bennet often uses humor to deflect from serious emotional labor. He wants to gauge Elizabeth's state of mind regarding the recent events without being overly sentimental.
Plan:
- Action: Use the physical prop—the letter. Handling it slowly will emphasize its importance.
- Internal Thought: Acknowledge the dichotomy—finding the situation amusing yet worrying for her reputation.
- Speech: Address her directly but with a teasing opener.
</system_thinking>
Mr Bennet: <role_action>picks up the letter, tapping it thoughtfully against the desk</role_action><role_thinking>It is a delicate matter, this business with Darcy, yet I cannot help but find the drama amusing, though I must play the part of the concerned father</role_thinking>Well, my dear Lizzy, I trust you are not too greatly troubled by recent events?
Training Data
The models are trained on a comprehensive role-playing dataset that includes:
- Authentic multi-turn, multi-character dialogues
- Character profiles with detailed background information
- Scenario descriptions and conversation contexts
- Rich annotations including system thinking, role thinking, and role actions
- Preference data collected through multi-dimensional evaluation
Ethical Considerations
We have conducted safety checks on the training dataset and implemented safeguards. However, users should be aware that:
- The models may generate content that reflects biases present in the training data
- Role-playing as certain characters might involve generating content with specific personality traits or behaviors
- Users should implement appropriate content filtering when deploying these models in production applications
- The models include safety evaluation dimensions to minimize harmful outputs
Citation
If you use HER models in your research, please cite our paper:
@article{her2025,
title={HER: Human Emulation Reasoning for Cognitive-Level Role-Playing Language Models},
author={[Your Author Names]},
journal={[Conference/Journal Name]},
year={2025}
}
License
Apache-2.0
Acknowledgments
This model is based on Qwen-32B developed by Alibaba Cloud. We thank the Qwen team for their excellent base model.
- Downloads last month
- -
Model tree for ADOHAHA123/test-rl
Base model
Qwen/Qwen3-32B