Instructions to use HuggingFaceTB/SmolLM-135M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceTB/SmolLM-135M-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM-135M-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct") model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceTB/SmolLM-135M-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceTB/SmolLM-135M-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM-135M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/HuggingFaceTB/SmolLM-135M-Instruct
- SGLang
How to use HuggingFaceTB/SmolLM-135M-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM-135M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM-135M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM-135M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM-135M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use HuggingFaceTB/SmolLM-135M-Instruct with Docker Model Runner:
docker model run hf.co/HuggingFaceTB/SmolLM-135M-Instruct
SmolLM Performance
I’ve been working with SmolLM recently, and the performance has been far below expectations—it's practically unusable. Here are a few examples to illustrate the issues:
Could it be that I'm loading the model incorrectly, or is this a known issue with SmolLM? Any advice on what might be going wrong would be greatly appreciated.
Hi, we just updated the Instruct Models and the outputs should be better. You can also try the larger 360M model for better performance in these demos:
https://huggingface.co/spaces/HuggingFaceTB/instant-smollm
https://huggingface.co/spaces/HuggingFaceTB/SmolLM-360M-Instruct-WebGPU
Thanks for the update! Could you please share what changes were made that led to the performance improvement? Was the model retrained with the original data, or were there other adjustments? Any details you can provide would be greatly appreciated. Thanks again for your help!
We changed the SFT mix (see changelog):
- it seems that using WebInstruct data for SFT sometimes confused the models, since it contained advanced science content beyond the model's capacity (hence why the models sometimes bring up math equations that are out of topic), so we switched to Magpie dataset
- with Magpie the model would answer knwoledge prompts but still failed at answering greetings and "who are you" questions so we built this dataset of 2k simple everyday conversations to fix this behavior https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k
Thanks for your quick reply!

