Ā·
AI & ML interests
None yet
Recent Activity
repliedto merve's post about 19 hours ago Don't sleep on new AI at Meta Vision-Language release! š„
https://huggingface.co/collections/facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
https://huggingface.co/collections/facebook/perception-lm-67f9783f171948c383ee7498
Meta dropped swiss army knives for vision with A2.0 license š
> image/video encoders for vision language modelling and spatial understanding (object detection etc) š
> The vision LM outperforms InternVL3 and Qwen2.5VL š
> They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 š
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models š®
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets
Authors release following datasets š
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations āÆļø
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA repliedto fdaudens's post about 2 months ago Ever wanted 45 min with one of AIās most fascinating minds? Was with @thomwolf at HumanX Vegas. Sharing my notes of his Q&A with the pressācompletely changed how I think about AIās future:
1ļøā£ The next wave of successful AI companies wonāt be defined by who has the best model but by who builds the most useful real-world solutions. "We all have engines in our cars, but thatās rarely the only reason we buy one. We expect it to work well, and thatās enough. LLMs will be the same."
2ļøā£ Big players are pivoting: "Closed-source companiesāOpenAI being the firstāhave largely shifted from LLM announcements to product announcements."
3ļøā£ Open source is changing everything: "DeepSeek was open source AIās ChatGPT moment. Basically, everyone outside the bubble realized you can get a model for freeāand itās just as good as the paid ones."
4ļøā£ Product innovation is being democratized: Take Manus, for exampleāthey built a product on top of Anthropicās models thatās "actually better than Anthropicās own product for now, in terms of agents." This proves that anyone can build great products with existing models.
Weāre entering a "multi-LLM world," where models are becoming commoditized, and all the tools to build are readily availableājust look at the flurry of daily new releases on Hugging Face.
Thom's comparison to the internet era is spot-on: "In the beginning you made a lot of money by making websites... but nowadays the huge internet companies are not the companies that built websites. Like Airbnb, Uber, Facebook, they just use the internet as a medium to make something for real life use cases."
Love to hear your thoughts on this shift! View all activity Organizations
None yet