NeoLLM / README.md
KitsuVp's picture
Upload README.md with huggingface_hub
f972dd3 verified
---
language: en
license: apache-2.0
tags:
- causal-lm
- research
- fp8
- attention
- normalization
- neollm
datasets:
- HuggingFaceFW/fineweb-edu
---
# NeoLLM
NeoLLM is a **135 M parameter** decoder-only language model trained from scratch on
[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) in **FP8**
precision, completing training in approximately **6 hours** on a single NVIDIA RTX 5090.
It integrates a collection of recently published attention and normalization techniques
into a single architecture, with the goal of studying how they interact during
pretraining. The model is actively being developed and the current checkpoint represents
an intermediate training state.
> **Author / contact:** [@Kyokopom](https://x.com/Kyokopom) on X
> **Repository:** [KitsuVp/NeoLLM](https://huggingface.co/KitsuVp/NeoLLM)
---
## Architecture
NeoLLM is a decoder-only transformer with the following configuration:
| Parameter | Value |
|---|---|
| Hidden size | 512 |
| Layers | 12 |
| Attention heads | 8 |
| KV heads (GQA) | 4 |
| Head dim | 64 |
| Intermediate size | 1536 |
| Vocabulary | Qwen3 tokenizer (64,402 tokens) |
| Context length | 512 tokens |
### Parameter breakdown
| Parameter bucket | Count |
|---|---|
| **Total parameters** | 113.07M (113,070,456) |
| **Embedding parameters** (tied) | 32.97M (32,973,824) |
| **Non-embedding parameters** | 80.10M (80,096,632) |
| **Effective trainable parameters** | 113.07M (113,070,456) |
> Weight tying is **enabled**: the input embedding matrix and the language-model head
> share the same parameters, so the effective trainable budget is
> `total − embed = 80.10M`.
### Integrated techniques
Each layer combines the following mechanisms simultaneously.
**Normalization and residual stream**
- **SeeDNorm** ([arXiv:2510.22777](https://arxiv.org/abs/2510.22777)) — Applied to Q and K
projections. Dynamically rescales the normalization based on the input's own statistics,
making the attention geometry more stable across varying input distributions.
- **PolyNorm** ([arXiv:2602.04902](https://arxiv.org/abs/2602.04902)) — Replaces the standard
MLP activation with three branches: linear (x), quadratic (x²), and cubic (x³) — each
normalized and combined with learned weights. This allows the MLP to express both linear
and non-linear relationships simultaneously.
- **GPAS** ([arXiv:2506.22049](https://arxiv.org/abs/2506.22049)) — Gradient-Preserving
Activation Scaling. Applied to residual connections between sublayers; helps gradients
flow more cleanly during training without distorting the residual stream.
- **LayerNorm Scaling / LNS** ([arXiv:2502.05795](https://arxiv.org/abs/2502.05795)) — Each
layer's output is scaled by 1/√ℓ where ℓ is the layer index. Directly addresses the
"Curse of Depth" in Pre-LN transformers.
**Attention mechanisms**
- **FAN** ([arXiv:2502.21309](https://arxiv.org/abs/2502.21309)) — Fourier Analysis Networks.
A portion of the input projection channels are dedicated to representing periodic patterns
(cosine/sine pairs), while the remainder handle standard linear content.
- **MEA** ([arXiv:2601.19611](https://arxiv.org/abs/2601.19611)) — Explicit Multi-head
Attention. Adds small learnable interaction matrices between attention heads for K and V.
- **LUCID** ([arXiv:2602.10410](https://arxiv.org/abs/2602.10410)) — Applies a learned
lower-triangular preconditioner to V before attention, decorrelating value representations
across positions.
- **Affine-Scaled Attention** ([arXiv:2602.23057](https://arxiv.org/abs/2602.23057)) — Adds
two learnable per-head scalars (α and β) to the softmax weights:
`[α·softmax(QKᵀ) + β]·V`.
- **XSA** ([arXiv:2603.09078](https://arxiv.org/abs/2603.09078)) — Exclusive Self Attention.
After computing attention, removes the component of the output aligned with the token's
own value vector.
- **Directional Routing** ([arXiv:2603.14923](https://arxiv.org/abs/2603.14923)) — Each head
learns K=4 directions in the output space; a learned router suppresses the attention output
along each direction per input.
- **Gated Attention** ([arXiv:2505.06708](https://arxiv.org/abs/2505.06708)) — A sigmoid gate
is applied to the attention output before the output projection, introducing non-linearity
and preventing attention sinks.
- **Momentum Attention** ([arXiv:2411.03884](https://arxiv.org/abs/2411.03884)) — Modifies Q
and K by subtracting a fraction of the previous position's Q and K values (causal
first-difference).
**MLP**
- **Learnable Multipliers** ([arXiv:2601.04890](https://arxiv.org/abs/2601.04890)) — Adds
per-row and per-column learnable scalar parameters to each linear layer.
- **SimpleGPT** ([arXiv:2602.01212](https://arxiv.org/abs/2602.01212)) — A normalization
strategy derived from second-order geometry analysis, applied inside MLP projections to
improve optimization stability.
---
## Training
| Setting | Value |
|---|---|
| Dataset | FineWeb-Edu (sample-10BT) |
| Tokens seen | ~0.51B (15,625 steps × batch 64 × length 512) |
| Precision | FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback |
| Optimizer | Conda (Column-Normalized Adam) |
| Learning rate | 6e-04 with linear warmup (10 % of steps) |
| Weight decay | 0.1 |
| Training time | ~1h 18m |
| Hardware | NVIDIA RTX 5090 (single GPU) |
### Training curve
| Step | Train Loss | Val Loss |
|---|---|---|
| 5,000 | 3.995 | 3.957 |
| 10,000 | 3.725 | 3.699 |
| 15,000 | 3.539 | 3.501 |
| 15,625 | — | 3.488 |
---
## Limitations
- **Token budget** — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks
will improve with more training.
- **Gradient spike at step 40k** — Reorganized the attention pattern in layer 9 that
previously captured long-range token correlations. A checkpoint from ~step 38k is expected
to have better aggregate benchmark scores.
- **PolyNorm exclusivity** — The quadratic branch has become partially redundant with the
linear branch. Will be corrected in the next training run.
- **Base model only** — Not instruction-tuned or aligned; purely a next-token-prediction
base model.
---
## References
All papers whose techniques are integrated into NeoLLM's architecture:
| Technique | Paper title | arXiv |
|---|---|---|
| SeeDNorm | Self-Rescaled Dynamic Normalization | [2510.22777](https://arxiv.org/abs/2510.22777) |
| MEA | Explicit Multi-head Attention | [2601.19611](https://arxiv.org/abs/2601.19611) |
| Learnable Multipliers | Freeing the Scale of Language Model Matrix Layers | [2601.04890](https://arxiv.org/abs/2601.04890) |
| Directional Routing | Directional Routing in Transformers | [2603.14923](https://arxiv.org/abs/2603.14923) |
| XSA | Exclusive Self Attention | [2603.09078](https://arxiv.org/abs/2603.09078) |
| Gated Attention | Gated Attention for LLMs | [2505.06708](https://arxiv.org/abs/2505.06708) |
| Affine-Scaled Attention | Affine-Scaled Attention | [2602.23057](https://arxiv.org/abs/2602.23057) |
| LNS | The Curse of Depth in LLMs | [2502.05795](https://arxiv.org/abs/2502.05795) |
| LUCID | Attention with Preconditioned Representations | [2602.10410](https://arxiv.org/abs/2602.10410) |
| FAN | Fourier Analysis Networks | [2502.21309](https://arxiv.org/abs/2502.21309) |
| SimpleGPT | SimpleGPT | [2602.01212](https://arxiv.org/abs/2602.01212) |
| GPAS | Gradient-Preserving Activation Scaling | [2506.22049](https://arxiv.org/abs/2506.22049) |
| PolyNorm | PolyNorm / PolyCom | [2602.04902](https://arxiv.org/abs/2602.04902) |
| Momentum Attention | Momentum Attention | [2411.03884](https://arxiv.org/abs/2411.03884) |
| TWEO (analysis ref.) | Transformers Without Extreme Outliers | [2511.23225](https://arxiv.org/abs/2511.23225) |
---
## Citation
```bibtex
@misc{neollm2026,
title = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques},
author = {KitsuVp},
year = {2026},
url = {https://huggingface.co/KitsuVp/NeoLLM}
}
```
---
## Author
[@Kyokopom](https://x.com/Kyokopom) on X
---
## License
Apache 2.0