SmolLM-125M: A Lightweight Language Model for Consumer Hardware

This is a 125M parameter language model designed to be trained and run on consumer hardware with limited VRAM (4GB+). The model follows a GPT-style architecture but is optimized for efficiency and memory usage.

Model Details

Architecture: GPT-style Transformer
Parameters: 125M
Context Length: 512 tokens
Vocabulary: 50,257 tokens (GPT-2 tokenizer)
Training Data: WikiText-2
Hardware Requirements: 4GB+ VRAM GPU

Architecture Specifications

Layers: 12 transformer blocks
Attention Heads: 12
Embedding Dimension: 768
Activation: GELU
Layer Normalization: Pre-norm

Training Details

Hardware Used: GTX 1650 (4GB VRAM)
Training Time: ~4 hours
Batch Size: 4 (16 with gradient accumulation)
Learning Rate: 3e-4 with cosine decay
Weight Decay: 0.1
Optimizer: AdamW

Memory Optimizations

Length-based batch scheduling
Gradient accumulation (4 steps)
Dynamic batch scheduling
Pre-padded sequences

Usage

from transformers import AutoTokenizer
from model import SmallLanguageModel, ModelConfig

# Initialize model
config = ModelConfig(
    vocab_size=50257,
    block_size=512,
    n_layer=12,
    n_head=12,
    n_embd=768,
    dropout=0.1,
    bias=True
)
model = SmallLanguageModel(config)

# Generate text
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_text = "Once upon a time"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output_ids = model.generate(input_ids, max_length=100)
generated_text = tokenizer.decode(output_ids[0])

Limitations

Limited context window (512 tokens)
Smaller capacity compared to larger models
Training data limited to WikiText-2

License

This model is released under the MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train waghmareps12/SmolLM_125M

Evaluation results

perplexity on WikiText-2
self-reported

to_be_updated
loss on WikiText-2
self-reported

to_be_updated