DiscoverLM-70M

A 69M parameter causal language model built on the Mixture-of-Attentions (MoA) architecture β€” distance-based metric attention that respects the triangle inequality by construction, not approximation.

Every attention head operates in a proper metric space. The geometry is enforced, not hoped for.

What Makes This Different

Standard transformers compute attention as a dot product: QΒ·Kα΅€. This has no geometric meaning β€” it's a bilinear form, not a distance. Two tokens can be "close" by dot product while violating basic metric properties.

MoA replaces this with negative squared distance under a learned diagonal Mahalanobis metric, then enforces the triangle inequality through a regularizer over random triples sampled during training. The result: attention weights reflect actual geometric proximity in a space where d(a,c) ≀ d(a,b) + d(b,c) holds.

This isn't a constraint that fights the model. It's structure the model uses.

Architecture

Input β†’ Token Embedding (48K vocab, custom tokenizer)
  β”‚
  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              MoA Block Γ— 4                       β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Local   β”‚ β”‚  Global  β”‚ β”‚Channel β”‚ β”‚  MQA   β”‚ β”‚
β”‚  β”‚  Conv    β”‚ β”‚  Metric  β”‚ β”‚  Mix   β”‚ β”‚ Metric β”‚ β”‚
β”‚  β”‚         β”‚ β”‚ (64 heads)β”‚ β”‚        β”‚ β”‚(64 Q)  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚              β–Ό                                   β”‚
β”‚     Feature Gates + Token Router (top-2)         β”‚
β”‚              β–Ό                                   β”‚
β”‚        Residual + DropPath                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
         HyperFFN (SwiGLU + CausalConv + LowRank)
                       β–Ό
                   LayerNorm
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            MoA Language Model Head               β”‚
β”‚  (same 4-path mixture β†’ SwiGLU β†’ tied vocab)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
                 Logits (48,000)

Core Components

Metric Attention. Queries attend to keys via learned Mahalanobis distance. Each of 64 heads has an 8-dimensional head space with its own diagonal scaling, learnable ball origin, and adaptive radius for sparse pruning. Pairs outside the ball are masked before softmax.

Mixture-of-Attentions Routing. Four parallel paths per token β€” local depthwise convolution, full multi-head metric attention, gated channel mixing, and multi-query metric attention. A learned router selects top-2 paths per token position. Feature gates scale each path's output before mixing.

BlackHoleRoPE. Rotary position encoding with learned phase perturbations from a compact Fourier basis. Q/K rotations stay unitary. V amplitudes get bounded energy gating clamped to [0.5, 2.0] with optional discrepancy-state modulation.

HyperFFN. Three-branch feedforward: SwiGLU channel MLP, causal depthwise separable convolution, and gated low-rank bottleneck β€” routed per-token with top-2 sparse selection.

MoA LM Head. The vocabulary projection runs its own mixture-of-attentions (32 heads, head_dim=16) before projecting to logits through a SwiGLU transform. Weight-tied to the input embedding.

Parameter Budget

Component Parameters %
Token embedding (tied) 24.6M 35.5%
MoA blocks Γ— 4 28.9M 41.8%
HyperFFN (shared) 4.2M 6.1%
MoA LM head 10.8M 15.6%
RoPE + norms 0.6M 0.9%
Total 69.1M

vs Standard Transformers

Transformer MoA
Attention scoring Dot product (QΒ·Kα΅€) Negative Mahalanobis distance
Geometric guarantee None Triangle inequality regularized
Position encoding RoPE BlackHoleRoPE (learned phase + bounded V energy)
Attention sparsity Causal mask only Ball pruning + top-k routing
Head combination Concatenation Per-token routed mixture of 4 path types
FFN Single MLP 3-branch routed (SwiGLU + CausalConv + LowRank)
LM head Linear projection Full MoA mixture β†’ SwiGLU β†’ tied projection

Training

Data

Dataset Domain
Opus-4.6-Reasoning-3000x-filtered Multi-step reasoning
UltraData-Math Mathematical problem solving
alpaca-cleaned General instruction following

Hyperparameters

Parameter Value
Optimizer AdamW
Learning rate 3e-4 β†’ 0 (cosine)
Batch size 4
Max sequence length 1,024
Steps 512
Epochs 8
Tokens seen 262,144
Precision fp32
Hardware NVIDIA H100 (Colab)
TI regularization Ξ»=0.01, 64 samples/batch
Router top-k 2 of 4 paths

Results

Epoch Avg Loss Min Loss Οƒ Token Accuracy
1 2.887 2.285 0.291 59.2%
2 2.324 1.651 0.259 63.4%
3 1.931 1.232 0.211 68.4%
4 1.616 1.012 0.201 74.4%
5 1.432 0.954 0.169 77.0%
6 1.211 0.677 0.180 79.0%
7 1.075 0.599 0.151 80.1%
8 1.014 0.718 0.142 80.8%

Best single step: 393 β€” loss 0.599, token accuracy 88.4%

Loss variance halved across training (Οƒ: 0.291 β†’ 0.142), indicating the mixture-of-attentions learned stable routing preferences as training progressed.

Configuration

{
  "dim": 512,
  "num_layers": 4,
  "attn_heads": 64,
  "mqa_q_heads": 64,
  "lm_attn_heads": 32,
  "lm_mqa_q_heads": 32,
  "metric": "maha_diag",
  "vocab_size": 48000,
  "max_position_embeddings": 1024,
  "ffn_hidden": 1536,
  "mixer_hidden": 768,
  "n_branches": 3,
  "router_topk": 2,
  "use_balls": true,
  "radius_init": 3.5,
  "ti_reg_weight": 0.01,
  "ti_reg_samples": 64,
  "energy_amplification": 9.87,
  "theta_base": 10000.0,
  "tie_word_embeddings": true
}

Tokenizer

Custom 48K vocabulary tokenizer with structured generation tokens built in:

{
  "backend": "tokenizers",
  "model_max_length": 2048,
  "bos_token": "<|bos|>",
  "eos_token": "<|eos|>",
  "pad_token": "<|pad|>",
  "unk_token": "<|unk|>",
  "extra_special_tokens": [
    "<|system|>", "<|user|>", "<|assistant|>",
    "<|think|>", "<|reasoning|>"
  ]
}

Usage

from transformers import AutoTokenizer
from MoA import MoAMetricLM, MoAMetricConfig

tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/DiscoverLM-70M")
model = MoAMetricLM.from_pretrained("reaperdoesntknow/DiscoverLM-70M")

inputs = tokenizer("The triangle inequality guarantees that", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Format

The tokenizer includes built-in special tokens for structured generation:

Token Role
<|system|> System prompt boundary
<|user|> User turn boundary
<|assistant|> Assistant turn boundary
<|think|> Internal reasoning start
<|reasoning|> Reasoning chain marker
<|bos|> Beginning of sequence
<|eos|> End of sequence
<|pad|> Padding
# Chat-style prompting
prompt = "<|system|>You are DiscoverLM, a small language model with metric attention.<|user|>What is the triangle inequality?<|assistant|><|think|><|reasoning|>"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)

Mathematical Foundation

The metric attention mechanism is grounded in the Discrepancy Calculus (DISC), a measure-theoretic framework for singularity analysis developed by the author. The triangle inequality regularizer enforces that the learned attention geometry satisfies d(a,c) ≀ d(a,b) + d(b,c) across sampled triples, ensuring the distance function used for attention scoring is a proper metric β€” not merely a similarity function.

The ball pruning mechanism (learnable per-head origins and radii) creates adaptive sparse attention patterns that emerge from the geometry itself rather than from fixed masking heuristics.

BlackHoleRoPE extends standard rotary position encoding with learned phase perturbations synthesized from a Fourier basis, maintaining the unitary property on Q/K while adding bounded amplitude modulation on V β€” ensuring position-dependent energy gating stays within Lyapunov-stable bounds.

Lineage

This architecture derives from research in metric-native neural computation:

  • DISC β€” Discrepancy Calculus: measure-theoretic singularity analysis (Colca, 2025)
  • MoA β€” Mixture-of-Attentions with triangle inequality enforcement
  • BlackHoleRoPE β€” Learned rotary position encoding with bounded energy gating

Limitations

  • Trained on 262K tokens β€” the architecture works, but this is a proof-of-concept scale. Generalization to unseen distributions is not yet validated.
  • No eval split was used; training metrics only.
  • 8 epochs over 64 batches means the model has seen each example multiple times. Overfitting is likely at this data scale.
  • fp32 training only β€” bf16/fp16 behavior untested.

Citation

@misc{CILLC2026discoverLM,
  author = {Convergent Intelligence LLC: Research Division},
  title = {DiscoverLM-70M: Metric-Attention Mixture of Attentions with Triangle Inequality Enforcement},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/reaperdoesntknow/DiscoverLM-70M}
}

Author

Roy Colca Jr. β€” Convergent Intelligence LLC

HuggingFace: reaperdoesntknow

Downloads last month
98
Safetensors
Model size
69.1M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train reaperdoesntknow/DiscoverLM-70M

Collection including reaperdoesntknow/DiscoverLM-70M