| --- |
| language: en |
| license: apache-2.0 |
| tags: |
| - causal-lm |
| - research |
| - fp8 |
| - attention |
| - normalization |
| - neollm |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| --- |
| |
| # NeoLLM |
|
|
| NeoLLM is a **135 M parameter** decoder-only language model trained from scratch on |
| [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) in **FP8** |
| precision, completing training in approximately **6 hours** on a single NVIDIA RTX 5090. |
| It integrates a collection of recently published attention and normalization techniques |
| into a single architecture, with the goal of studying how they interact during |
| pretraining. The model is actively being developed and the current checkpoint represents |
| an intermediate training state. |
|
|
| > **Author / contact:** [@Kyokopom](https://x.com/Kyokopom) on X |
| > **Repository:** [KitsuVp/NeoLLM](https://huggingface.co/KitsuVp/NeoLLM) |
|
|
| --- |
|
|
| ## Architecture |
|
|
| NeoLLM is a decoder-only transformer with the following configuration: |
|
|
| | Parameter | Value | |
| |---|---| |
| | Hidden size | 512 | |
| | Layers | 12 | |
| | Attention heads | 8 | |
| | KV heads (GQA) | 4 | |
| | Head dim | 64 | |
| | Intermediate size | 1536 | |
| | Vocabulary | Qwen3 tokenizer (64,402 tokens) | |
| | Context length | 512 tokens | |
|
|
| ### Parameter breakdown |
|
|
| | Parameter bucket | Count | |
| |---|---| |
| | **Total parameters** | 113.07M (113,070,456) | |
| | **Embedding parameters** (tied) | 32.97M (32,973,824) | |
| | **Non-embedding parameters** | 80.10M (80,096,632) | |
| | **Effective trainable parameters** | 113.07M (113,070,456) | |
|
|
| > Weight tying is **enabled**: the input embedding matrix and the language-model head |
| > share the same parameters, so the effective trainable budget is |
| > `total − embed = 80.10M`. |
|
|
| ### Integrated techniques |
|
|
| Each layer combines the following mechanisms simultaneously. |
|
|
| **Normalization and residual stream** |
|
|
| - **SeeDNorm** ([arXiv:2510.22777](https://arxiv.org/abs/2510.22777)) — Applied to Q and K |
| projections. Dynamically rescales the normalization based on the input's own statistics, |
| making the attention geometry more stable across varying input distributions. |
| - **PolyNorm** ([arXiv:2602.04902](https://arxiv.org/abs/2602.04902)) — Replaces the standard |
| MLP activation with three branches: linear (x), quadratic (x²), and cubic (x³) — each |
| normalized and combined with learned weights. This allows the MLP to express both linear |
| and non-linear relationships simultaneously. |
| - **GPAS** ([arXiv:2506.22049](https://arxiv.org/abs/2506.22049)) — Gradient-Preserving |
| Activation Scaling. Applied to residual connections between sublayers; helps gradients |
| flow more cleanly during training without distorting the residual stream. |
| - **LayerNorm Scaling / LNS** ([arXiv:2502.05795](https://arxiv.org/abs/2502.05795)) — Each |
| layer's output is scaled by 1/√ℓ where ℓ is the layer index. Directly addresses the |
| "Curse of Depth" in Pre-LN transformers. |
|
|
| **Attention mechanisms** |
|
|
| - **FAN** ([arXiv:2502.21309](https://arxiv.org/abs/2502.21309)) — Fourier Analysis Networks. |
| A portion of the input projection channels are dedicated to representing periodic patterns |
| (cosine/sine pairs), while the remainder handle standard linear content. |
| - **MEA** ([arXiv:2601.19611](https://arxiv.org/abs/2601.19611)) — Explicit Multi-head |
| Attention. Adds small learnable interaction matrices between attention heads for K and V. |
| - **LUCID** ([arXiv:2602.10410](https://arxiv.org/abs/2602.10410)) — Applies a learned |
| lower-triangular preconditioner to V before attention, decorrelating value representations |
| across positions. |
| - **Affine-Scaled Attention** ([arXiv:2602.23057](https://arxiv.org/abs/2602.23057)) — Adds |
| two learnable per-head scalars (α and β) to the softmax weights: |
| `[α·softmax(QKᵀ) + β]·V`. |
| - **XSA** ([arXiv:2603.09078](https://arxiv.org/abs/2603.09078)) — Exclusive Self Attention. |
| After computing attention, removes the component of the output aligned with the token's |
| own value vector. |
| - **Directional Routing** ([arXiv:2603.14923](https://arxiv.org/abs/2603.14923)) — Each head |
| learns K=4 directions in the output space; a learned router suppresses the attention output |
| along each direction per input. |
| - **Gated Attention** ([arXiv:2505.06708](https://arxiv.org/abs/2505.06708)) — A sigmoid gate |
| is applied to the attention output before the output projection, introducing non-linearity |
| and preventing attention sinks. |
| - **Momentum Attention** ([arXiv:2411.03884](https://arxiv.org/abs/2411.03884)) — Modifies Q |
| and K by subtracting a fraction of the previous position's Q and K values (causal |
| first-difference). |
|
|
| **MLP** |
|
|
| - **Learnable Multipliers** ([arXiv:2601.04890](https://arxiv.org/abs/2601.04890)) — Adds |
| per-row and per-column learnable scalar parameters to each linear layer. |
| - **SimpleGPT** ([arXiv:2602.01212](https://arxiv.org/abs/2602.01212)) — A normalization |
| strategy derived from second-order geometry analysis, applied inside MLP projections to |
| improve optimization stability. |
|
|
| --- |
|
|
| ## Training |
|
|
| | Setting | Value | |
| |---|---| |
| | Dataset | FineWeb-Edu (sample-10BT) | |
| | Tokens seen | ~0.51B (15,625 steps × batch 64 × length 512) | |
| | Precision | FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback | |
| | Optimizer | Conda (Column-Normalized Adam) | |
| | Learning rate | 6e-04 with linear warmup (10 % of steps) | |
| | Weight decay | 0.1 | |
| | Training time | ~1h 18m | |
| | Hardware | NVIDIA RTX 5090 (single GPU) | |
|
|
| ### Training curve |
|
|
| | Step | Train Loss | Val Loss | |
| |---|---|---| |
| | 5,000 | 3.995 | 3.957 | |
| | 10,000 | 3.725 | 3.699 | |
| | 15,000 | 3.539 | 3.501 | |
| | 15,625 | — | 3.488 | |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - **Token budget** — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks |
| will improve with more training. |
| - **Gradient spike at step 40k** — Reorganized the attention pattern in layer 9 that |
| previously captured long-range token correlations. A checkpoint from ~step 38k is expected |
| to have better aggregate benchmark scores. |
| - **PolyNorm exclusivity** — The quadratic branch has become partially redundant with the |
| linear branch. Will be corrected in the next training run. |
| - **Base model only** — Not instruction-tuned or aligned; purely a next-token-prediction |
| base model. |
|
|
| --- |
|
|
| ## References |
|
|
| All papers whose techniques are integrated into NeoLLM's architecture: |
|
|
| | Technique | Paper title | arXiv | |
| |---|---|---| |
| | SeeDNorm | Self-Rescaled Dynamic Normalization | [2510.22777](https://arxiv.org/abs/2510.22777) | |
| | MEA | Explicit Multi-head Attention | [2601.19611](https://arxiv.org/abs/2601.19611) | |
| | Learnable Multipliers | Freeing the Scale of Language Model Matrix Layers | [2601.04890](https://arxiv.org/abs/2601.04890) | |
| | Directional Routing | Directional Routing in Transformers | [2603.14923](https://arxiv.org/abs/2603.14923) | |
| | XSA | Exclusive Self Attention | [2603.09078](https://arxiv.org/abs/2603.09078) | |
| | Gated Attention | Gated Attention for LLMs | [2505.06708](https://arxiv.org/abs/2505.06708) | |
| | Affine-Scaled Attention | Affine-Scaled Attention | [2602.23057](https://arxiv.org/abs/2602.23057) | |
| | LNS | The Curse of Depth in LLMs | [2502.05795](https://arxiv.org/abs/2502.05795) | |
| | LUCID | Attention with Preconditioned Representations | [2602.10410](https://arxiv.org/abs/2602.10410) | |
| | FAN | Fourier Analysis Networks | [2502.21309](https://arxiv.org/abs/2502.21309) | |
| | SimpleGPT | SimpleGPT | [2602.01212](https://arxiv.org/abs/2602.01212) | |
| | GPAS | Gradient-Preserving Activation Scaling | [2506.22049](https://arxiv.org/abs/2506.22049) | |
| | PolyNorm | PolyNorm / PolyCom | [2602.04902](https://arxiv.org/abs/2602.04902) | |
| | Momentum Attention | Momentum Attention | [2411.03884](https://arxiv.org/abs/2411.03884) | |
| | TWEO (analysis ref.) | Transformers Without Extreme Outliers | [2511.23225](https://arxiv.org/abs/2511.23225) | |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{neollm2026, |
| title = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques}, |
| author = {KitsuVp}, |
| year = {2026}, |
| url = {https://huggingface.co/KitsuVp/NeoLLM} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Author |
|
|
| [@Kyokopom](https://x.com/Kyokopom) on X |
|
|
| --- |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|