AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
Abstract
A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations.
Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Coordinate-Based Dual-Constrained Autoregressive Motion Generation (2026)
- EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation (2026)
- ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation (2026)
- IAM: Identity-Aware Human Motion and Shape Joint Generation (2026)
- Archon: A Unified Multimodal Model for Holistic Digital Human Generation (2026)
- MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation (2026)
- MAGE: Modality-Agnostic Music Generation and Editing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.29488 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
