AI & ML interests
Exploring smol models (for text, vision and video) and high quality web and synthetic datasets
Papers
View all PapersSmol, multilingual, long-context reasoner
Datasets to decontaminate the post-training mixtures against. Use the subset and column values described per entry
State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M
-
HuggingFaceTB/SmolLM2-1.7B-Instruct
Text Generation • 2B • Updated • 122k • 733 -
HuggingFaceTB/SmolLM2-1.7B
Text Generation • 2B • Updated • 168k • 152 -
HuggingFaceTB/SmolLM2-360M-Instruct
Text Generation • 0.4B • Updated • 233k • 194 -
HuggingFaceTB/SmolLM2-360M
Text Generation • 0.4B • Updated • 46.3k • 107
A collection of datasets for LLM pretraining
Collection for models & demos for even smoller SmolVLM release
-
HuggingFaceTB/SmolVLM-256M-Instruct
Image-Text-to-Text • 0.3B • Updated • 674k • 366 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text • 0.5B • Updated • 227k • 195 -
SmolVLM
📊67Generate descriptions from images and text prompts
-
HuggingFaceTB/SmolVLM-256M-Base
Image-Text-to-Text • 0.3B • Updated • 230 • 23
SmolLM models in MLC, ONNX and GGUF format for local applications + in-browser demos
-
HuggingFaceTB/everyday-conversations-llama3.1-2k
Viewer • Updated • 2.38k • 1.67k • 131 -
HuggingFaceTB/Magpie-Pro-300K-Filtered-H4
Viewer • Updated • 300k • 133 • 5 -
HuggingFaceTB/OpenHermes-2.5-H4
Viewer • Updated • 1M • 116 • 6 -
HuggingFaceTB/self-oss-instruct-sc2-H4
Viewer • Updated • 50.7k • 59 • 5
datasets used in SmolLM3 pretraining
State-of-the-art compact VLMs for on-device applications: Base, Synthetic, and Instruct. Check our blog: https://huggingface.co/blog/smolvlm
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text • 2B • Updated • 17.7k • 588 -
HuggingFaceTB/SmolVLM-Base
Image-Text-to-Text • 2B • Updated • 1.3k • 88 -
HuggingFaceTB/SmolVLM-Synthetic
Image-Text-to-Text • 2B • Updated • 41 • 12 -
HuggingFaceTB/SmolVLM-Instruct-DPO
Image-Text-to-Text • Updated • 7 • 22
🔥 15 classifiers, 124M parameters, one per programming language— for assessing the educational value of GitHub code
FineMath datasets and ablation models
A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos
Resources for Cosmopedia dataset
Smol, multilingual, long-context reasoner
datasets used in SmolLM3 pretraining
Datasets to decontaminate the post-training mixtures against. Use the subset and column values described per entry
State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M
-
HuggingFaceTB/SmolLM2-1.7B-Instruct
Text Generation • 2B • Updated • 122k • 733 -
HuggingFaceTB/SmolLM2-1.7B
Text Generation • 2B • Updated • 168k • 152 -
HuggingFaceTB/SmolLM2-360M-Instruct
Text Generation • 0.4B • Updated • 233k • 194 -
HuggingFaceTB/SmolLM2-360M
Text Generation • 0.4B • Updated • 46.3k • 107
A collection of datasets for LLM pretraining
State-of-the-art compact VLMs for on-device applications: Base, Synthetic, and Instruct. Check our blog: https://huggingface.co/blog/smolvlm
-
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text • 2B • Updated • 17.7k • 588 -
HuggingFaceTB/SmolVLM-Base
Image-Text-to-Text • 2B • Updated • 1.3k • 88 -
HuggingFaceTB/SmolVLM-Synthetic
Image-Text-to-Text • 2B • Updated • 41 • 12 -
HuggingFaceTB/SmolVLM-Instruct-DPO
Image-Text-to-Text • Updated • 7 • 22
🔥 15 classifiers, 124M parameters, one per programming language— for assessing the educational value of GitHub code
Collection for models & demos for even smoller SmolVLM release
-
HuggingFaceTB/SmolVLM-256M-Instruct
Image-Text-to-Text • 0.3B • Updated • 674k • 366 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text • 0.5B • Updated • 227k • 195 -
SmolVLM
📊67Generate descriptions from images and text prompts
-
HuggingFaceTB/SmolVLM-256M-Base
Image-Text-to-Text • 0.3B • Updated • 230 • 23
FineMath datasets and ablation models
SmolLM models in MLC, ONNX and GGUF format for local applications + in-browser demos
A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos
-
HuggingFaceTB/everyday-conversations-llama3.1-2k
Viewer • Updated • 2.38k • 1.67k • 131 -
HuggingFaceTB/Magpie-Pro-300K-Filtered-H4
Viewer • Updated • 300k • 133 • 5 -
HuggingFaceTB/OpenHermes-2.5-H4
Viewer • Updated • 1M • 116 • 6 -
HuggingFaceTB/self-oss-instruct-sc2-H4
Viewer • Updated • 50.7k • 59 • 5
Resources for Cosmopedia dataset