StepFun has been focused on multimodal AI from the very beginning. Their latest release a new foundational model: STEP3-VL🔥 https://huggingface.co/collections/stepfun-ai/step3-vl-10b ✨ 10B - Apache2.0 ✨ Leads in the 10B class and competes with models 10–20× larger
✨ Hybrid Architecture: combined autoregressive + diffusion design delivers strong semantic alignment with high-fidelity details ✨ Strong performance in long, dense, and multilingual text rendering ✨ MIT licensed (VQ tokenizer & ViT weights under Apache 2.0) ✨ Now live on Hugging Face inference provider 🤗
AgentCPM-Explore🔥 on device agent foundation model released by OpenBMB openbmb/AgentCPM-Explore ✨ 4B - Apache2.0 ✨ Supports 100+ multi-turn environment interactions with search + verification ✨ Full training/inference stack is openly shared as well
✨ Big wave of foundation models: still scaling, but efficiency, reasoning, and deployment now matter more than size - DeepSeek-V3.2 - Z.ai GLM-4.7 - MiniMax-M2.1 - Xiaomi: MiMo-V2-Flash
✨ Multimodal reasoning is now default - Z.ai GLM-4.6V - Z.ai AutoGLM-Phone 9B - Bytedance: Dolphin-v2
Only a year into open source, MiniMax is already making a great impact. Not only through solid models/products, but also by how well the team uses community platforms like Hugging Face. HF Teams, blogs, Daily Papers, Spaces as project pages, and always experimenting with new ways to engage. Super impressive!
Following up on LLaDA 2.0 , the paper is now out on Daily Papers🔥 It has sparked a lot of discussion in the community for showing how discrete diffusion LLMs can scale to 100B and run faster than traditional AR models. LLaDA2.0: Scaling Up Diffusion Language Models to 100B (2512.15745)
✨ Built from real enterprise data (Enron + financial institutions), not synthetic tasks ✨ Tests end-to-end finance workflows ✨ Multimodal & cross-file reasoning ✨ Expert annotated (700+ hours) and genuinely challenging hard