Overview

This model implements a novel approach to multi-reference video generation using Multiple Subject Reference (MSR). Instead of introducing additional encoder branches or fusion modules, we transform multiple static reference images into a pseudo-video sequence that shares the same representation space as the target video.

Usage

This LoRA requires the ComfyUI-Licon-MSR plugin for ComfyUI. A sample workflow is included in the model files for easy testing and experimentation.

Key Features

Multi-Reference Visual Memory

Token-level reference preservation: Multiple reference images are encoded as video latents, preserving fine-grained visual information at token level rather than compressing into a single embedding
Native self-attention retrieval: The target video tokens directly access reference tokens through the model's existing self-attention mechanism—no new architectural components needed
In-context conditioning: References serve as "visual memory" within the main token sequence, not as external conditioning inputs

Flexible Reference Composition

2 to 5 reference images: Supports varying numbers of reference inputs with increasing complexity
Complementary semantic roles: Each reference image can carry different information:
- Subject identity
- Object/prop details
- Scene/background
- Local textures
- Multiple viewpoints

What It Can Do

Identity Preservation Across References

Generate videos where multiple reference identities are simultaneously preserved:

Multiple characters from different reference images
Character + object combinations
Object + scene compositions

Relation-Based Composition

Beyond mere identity preservation, the model can compose references based on textual relation descriptions:

Action interactions (handing, picking up, pushing)
Spatial relationships (left-right, foreground-background)
Temporal event structures (start → process → result)

Cross-Reference Attribute Selection

The model learns to selectively retrieve attributes from different references:

Face from reference A, clothing from reference B
Object identity from one reference, pose/position from another
Background elements from scene references

Usage Tips (V1 Version)

Prompt description: Requires concise but accurate description of reference images. Over-description or under-description both lead to consistency degradation
High-motion scenes: 50fps recommended to ensure smooth motion coherence
Generation reliability: Typically requires 2-3 sampling runs to achieve accurate results