BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
Abstract
BEVFusion unifies multi-modal sensor features in a bird's-eye view for efficient and accurate 3D perception tasks, achieving state-of-the-art results with reduced computational cost.
Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost. Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation (2026)
- Boosting Instance Awareness via Cross-View Correlation with 4D Radar and Camera for 3D Object Detection (2026)
- Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving (2026)
- Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection (2026)
- Fusion4CA: Boosting 3D Object Detection via Comprehensive Image Exploitation (2026)
- 4DRC-OCC: Robust Semantic Occupancy Prediction Through Fusion of 4D Radar and Camera (2026)
- FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2205.13542 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Reality123b/FSD-Level5-CoT
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 2
Collections including this paper 0
No Collection including this paper