ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
Figure: ROMA processes streaming inputs as aligned multimodal units, using a 'Speak Head' to decide when to respond.
Model Summary
ROMA is a Real-time Omni-Multimodal Assistant designed for unified streaming audio-video understanding. Unlike traditional videoLLMs that only answer after a query, ROMA integrates both Reactive (Question Answering) and Proactive (Event-Driven Alert, Real-Time Narration) capabilities within a single framework.
ROMA introduces a "Speak Head" mechanism to decouple response timing from content generation, allowing it to autonomously decide when to speak based on the continuous audio-visual stream.
- Paper: ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
- Project Page: Link
- Repository: [Github (Coming Soon)]
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support