CLAP Collection Pretrained models for "CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos". • 2 items • Updated about 16 hours ago
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model Paper • 2410.13925 • Published Oct 17, 2024 • 24