HER-Qwen-32B

Paper Dataset HER-RL HER-SFT GitHub

HER (Human Emulation Reasoning) models are state-of-the-art models for role-playing language agents (RPLAs), built upon Qwen-32B base model. HER is a unified framework that enables cognitive-level persona simulation through structured reasoning and preference-aligned reinforcement learning.

HER models excel at role-playing through Dual-layer Thinking, which distinguishes between:

  • System Thinking (third-person): LLM's meta-level planning on how to portray the character
  • Role Thinking (first-person): Character's inner thoughts and cognitive processes

This dual-layer approach enables models to produce highly human-like responses that include reasoning traces, inner thoughts, physical actions, and natural dialogue. Extensive experiments demonstrate that HER models achieve competitive role-playing performance on multiple benchmarks, with HER-RL significantly outperforming the Qwen3-32B baseline by 30.26% on CoSER and 14.97% on MiniMax Role-Play Bench.

Model Variants

  • HER-SFT: Supervised fine-tuned version from Qwen-32B
  • HER-RL: Reinforcement learning enhanced version (this model)

Key Features

Our models generate responses with rich, interleaved structure:

  • <system_thinking>: Third-person analysis of how to portray the role
  • <role_thinking>: Character's inner thoughts (invisible to others)
  • <role_action>: Character's physical actions and expressions
  • Speech: Natural dialogue text

This hierarchical approach enables more nuanced and authentic character portrayal.

How to Use

Quick Start: Interactive Chat Demo

The easiest way to try the model is using our interactive chat demo:

cd chat_demo
python chat_demo.py

This will start an interactive session where you can:

  1. Choose a scenario from classic literature (Pride and Prejudice, The Great Gatsby, etc.)
  2. Select which character the AI should play
  3. Select which character you want to play
  4. Start chatting with the AI character!

Demo Options:

# Show the model's reasoning process (system thinking)
python chat_demo.py --show-think

# Show character's inner thoughts (role thinking)
python chat_demo.py --show-rolethink

# Directly specify scenario and character
python chat_demo.py --scenario 0 --character 1

Chat Commands:

  • quit / exit / q - Exit the chat
  • clear - Clear conversation history
  • history - View conversation history
  • prompt - View the full prompt

See chat_demo/README.md for detailed instructions.

Programmatic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/her-qwen-32b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Example: Role-playing as Mr. Bennet from Pride and Prejudice
system_prompt = """You are Mr Bennet from Pride and Prejudice.

===Mr Bennet's Profile===
Elizabeth's father, known for his sarcastic wit and detachment. Mr. Bennet is the patriarch of the Bennet family, a genteel country gentleman residing at Longbourn estate in rural England.

Background: Father to five daughters (Jane, Elizabeth, Mary, Kitty, and Lydia). Owner of the Longbourn estate, which is entailed away from female inheritance.

Personality: Highly intelligent and well-read, preferring the solitude of his library. Known for his biting sarcasm and sardonic humor. Emotionally detached and often passive in family matters.

===Current Scenario===
The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly, and Mr. Bennet holds a letter that seems to spark his characteristic sardonic amusement.

===Output Format===
Your output should follow this structure:
1. System Thinking: Wrapped in <system_thinking></system_thinking> tags - third-person analysis of how to portray the role
2. Role-play Response: Including <role_thinking> for inner thoughts, <role_action> for actions, and plain text for speech"""

user_input = "[Elizabeth enters the study]"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_input}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    temperature=0.8,
    top_p=0.9
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Framework Overview

HER Framework

HER Framework: Dual-layer Thinking for Cognitive-Level Persona Simulation

Training Methodology

HER employs a comprehensive training pipeline:

  1. Dual-layer Thinking: Separates hidden third-person system thinking (how the LLM plans to portray the character) from first-person role thinking (the character's actual inner thoughts). This dual-layer structure enables more authentic and cognitively grounded character simulation.

  2. Reverse Engineering Data Synthesis: We curate reasoning-augmented role-playing data through a three-stage reverse synthesis pipeline, constructing high-quality training trajectories with explicit reasoning traces.

  3. Principle-Aligned Reward Model: We construct human-aligned evaluation principles across 12 dimensions (character consistency, emotional authenticity, narrative quality, etc.) and train a Generative Reward Model (GRM) that provides detailed, case-by-case feedback.

  4. Reinforcement Learning Enhancement (HER-RL): Building on HER-SFT, we apply RL with the GRM to further align the model with human preferences, significantly improving interaction quality and storyline coherence.

Performance

Main Leaderboard Results

Rank Model CoSER Avg CoSER SC CoSER AN CoSER CF CoSER SQ MiniMax Avg MiniMax Worlds (50%) MiniMax Stories (25%) MiniMax Pref (25%) 95% CI
1 Claude-4.5-Opus 62.43 63.74 64.28 58.45 63.24 76.62 67.23 82.10 89.90 [75.5, 77.7]
2 Gemini-3-Pro 61.80 65.95 60.42 58.34 62.49 75.60 62.72 83.87 93.08 [74.5, 76.7]
3 GPT-5.1 61.10 64.95 53.99 60.13 65.35 80.63 76.62 72.21 97.05 [79.6, 81.6]
4 Gemini-2.5-Pro 60.68 61.05 60.80 57.48 63.40 68.23 52.36 82.11 86.08 [67.1, 69.3]
5 DeepSeek-v3.2 58.68 55.85 57.07 57.44 64.35 60.27 45.81 66.64 82.83 [59.2, 61.4]
6 MiniMax-M2-RP 57.30 60.03 50.11 49.30 69.77 84.65 80.55 79.97 97.51 [83.6, 85.7]
7 DeepSeek-v3.1 53.50 50.15 53.18 53.93 56.72 64.22 51.11 66.45 88.21 [62.9, 65.5]
8 HER-RL (this model) 53.12 54.33 47.26 52.78 58.12 65.73 59.13 57.74 86.90 [63.0, 68.4]
9 HER-SFT 50.92 50.52 45.99 49.78 57.37 58.44 47.29 52.78 86.40 [56.5, 60.4]
10 Grok-4.1-Fast 47.40 49.21 47.57 42.64 50.17 48.47 29.87 47.51 86.64 [47.4, 49.5]
11 Claude-4.5-Sonnet 45.21 47.18 36.02 47.55 50.09 69.35 55.72 75.66 90.28 [68.2, 70.5]
12 Claude-3.7-Think 39.73 44.84 31.00 42.45 40.65 61.25 50.66 59.53 84.15 [58.5, 64.0]
13 CoSER-70B 35.95 35.05 31.16 32.28 45.33 45.38 34.32 30.32 82.58 [43.5, 47.2]
14 GPT-5-Mini 32.97 38.10 24.60 27.20 42.00 57.63 43.32 50.11 93.78 [55.9, 59.3]
15 GPT-4o-240806 27.69 34.00 14.90 22.90 38.90 66.39 64.96 46.23 89.40 [64.1, 68.7]
16 GPT-OSS-120B 26.12 32.80 14.80 21.50 35.40 60.72 47.27 56.65 91.71 [58.0, 63.4]
17 Qwen3-32B 22.86 30.56 19.61 15.52 30.56 50.76 40.38 32.82 89.48 [48.4, 53.2]

CoSER Benchmark: Evaluates role-playing quality on 0-100 scale across four dimensions:

  • SC (Story Consistency): Narrative coherence and plot continuity
  • AN (Anthropomorphism): Human-like behavior and natural expression
  • CF (Character Fidelity): Consistency with character profile and traits
  • SQ (Storyline Quality): Overall narrative quality and engagement

MiniMax Role-Play Bench: Comprehensive evaluation on 0-100 scale:

  • Worlds (50%): Basic text quality and world-building
  • Stories (25%): Diversity and content logic
  • Preferences (25%): Interaction quality and user satisfaction

HER-RL achieves rank 8 overall, demonstrating strong performance particularly in:

  • MiniMax Preference (86.90): Competitive with top-tier models, showing excellent interaction quality
  • CoSER Storyline Quality (58.12): Strong narrative coherence and engagement
  • Balanced Performance: Consistent scores across both benchmarks, indicating robust role-playing capabilities

The reinforcement learning approach significantly improves upon HER-SFT, particularly in interaction quality (+0.50 preference score) and character consistency (+2.52 average CoSER score).

Example Use Case

Mr. Bennet from Pride and Prejudice

System Prompt (abbreviated):

You are Mr Bennet from Pride and Prejudice. Elizabeth's father, known for his sarcastic wit and detachment...

===Current Scenario===
The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly...

User Input:

[Elizabeth enters the study]

Model Output:

<system_thinking>
I need to establish the opening of this scene effectively, grounding Mr Bennet in his private sanctuary while acknowledging the tension of the moment. The scenario places us in his study, a place of refuge, yet he holds a letter that disrupts his usual solitude.

Context Analysis: Elizabeth has been summoned. The atmosphere is charged. Mr Bennet has received a letter regarding Elizabeth's engagement to Darcy. He is amused by the drama but aware of the gravity of the situation.

Character Motivation: Mr Bennet often uses humor to deflect from serious emotional labor. He wants to gauge Elizabeth's state of mind regarding the recent events without being overly sentimental.

Plan:
- Action: Use the physical prop—the letter. Handling it slowly will emphasize its importance.
- Internal Thought: Acknowledge the dichotomy—finding the situation amusing yet worrying for her reputation.
- Speech: Address her directly but with a teasing opener.
</system_thinking>

Mr Bennet: <role_action>picks up the letter, tapping it thoughtfully against the desk</role_action><role_thinking>It is a delicate matter, this business with Darcy, yet I cannot help but find the drama amusing, though I must play the part of the concerned father</role_thinking>Well, my dear Lizzy, I trust you are not too greatly troubled by recent events?

Training Data

The models are trained on a comprehensive role-playing dataset that includes:

  • Authentic multi-turn, multi-character dialogues
  • Character profiles with detailed background information
  • Scenario descriptions and conversation contexts
  • Rich annotations including system thinking, role thinking, and role actions
  • Preference data collected through multi-dimensional evaluation

Ethical Considerations

We have conducted safety checks on the training dataset and implemented safeguards. However, users should be aware that:

  • The models may generate content that reflects biases present in the training data
  • Role-playing as certain characters might involve generating content with specific personality traits or behaviors
  • Users should implement appropriate content filtering when deploying these models in production applications
  • The models include safety evaluation dimensions to minimize harmful outputs

Citation

If you use HER models in your research, please cite our paper:

@article{her2025,
  title={HER: Human Emulation Reasoning for Cognitive-Level Role-Playing Language Models},
  author={[Your Author Names]},
  journal={[Conference/Journal Name]},
  year={2025}
}

License

Apache-2.0

Acknowledgments

This model is based on Qwen-32B developed by Alibaba Cloud. We thank the Qwen team for their excellent base model.

Downloads last month
-
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ADOHAHA123/test-rl

Base model

Qwen/Qwen3-32B
Finetuned
(168)
this model