NeoLLM / README.md

Upload README.md with huggingface_hub

f972dd3 verified about 8 hours ago

8.16 kB

	---
	language: en
	license: apache-2.0
	tags:
	- causal-lm
	- research
	- fp8
	- attention
	- normalization
	- neollm
	datasets:
	- HuggingFaceFW/fineweb-edu
	---

	# NeoLLM

	NeoLLM is a 135 M parameter decoder-only language model trained from scratch on
	[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) in FP8
	precision, completing training in approximately 6 hours on a single NVIDIA RTX 5090.
	It integrates a collection of recently published attention and normalization techniques
	into a single architecture, with the goal of studying how they interact during
	pretraining. The model is actively being developed and the current checkpoint represents
	an intermediate training state.

	> Author / contact: [@Kyokopom](https://x.com/Kyokopom) on X
	> Repository: [KitsuVp/NeoLLM](https://huggingface.co/KitsuVp/NeoLLM)

	---

	## Architecture

	NeoLLM is a decoder-only transformer with the following configuration:

	\| Parameter \| Value \|
	\|---\|---\|
	\| Hidden size \| 512 \|
	\| Layers \| 12 \|
	\| Attention heads \| 8 \|
	\| KV heads (GQA) \| 4 \|
	\| Head dim \| 64 \|
	\| Intermediate size \| 1536 \|
	\| Vocabulary \| Qwen3 tokenizer (64,402 tokens) \|
	\| Context length \| 512 tokens \|

	### Parameter breakdown

	\| Parameter bucket \| Count \|
	\|---\|---\|
	\| Total parameters \| 113.07M (113,070,456) \|
	\| Embedding parameters (tied) \| 32.97M (32,973,824) \|
	\| Non-embedding parameters \| 80.10M (80,096,632) \|
	\| Effective trainable parameters \| 113.07M (113,070,456) \|

	> Weight tying is enabled: the input embedding matrix and the language-model head
	> share the same parameters, so the effective trainable budget is
	> `total − embed = 80.10M`.

	### Integrated techniques

	Each layer combines the following mechanisms simultaneously.

	Normalization and residual stream

	- SeeDNorm ([arXiv:2510.22777](https://arxiv.org/abs/2510.22777)) — Applied to Q and K
	projections. Dynamically rescales the normalization based on the input's own statistics,
	making the attention geometry more stable across varying input distributions.
	- PolyNorm ([arXiv:2602.04902](https://arxiv.org/abs/2602.04902)) — Replaces the standard
	MLP activation with three branches: linear (x), quadratic (x²), and cubic (x³) — each
	normalized and combined with learned weights. This allows the MLP to express both linear
	and non-linear relationships simultaneously.
	- GPAS ([arXiv:2506.22049](https://arxiv.org/abs/2506.22049)) — Gradient-Preserving
	Activation Scaling. Applied to residual connections between sublayers; helps gradients
	flow more cleanly during training without distorting the residual stream.
	- LayerNorm Scaling / LNS ([arXiv:2502.05795](https://arxiv.org/abs/2502.05795)) — Each
	layer's output is scaled by 1/√ℓ where ℓ is the layer index. Directly addresses the
	"Curse of Depth" in Pre-LN transformers.

	Attention mechanisms

	- FAN ([arXiv:2502.21309](https://arxiv.org/abs/2502.21309)) — Fourier Analysis Networks.
	A portion of the input projection channels are dedicated to representing periodic patterns
	(cosine/sine pairs), while the remainder handle standard linear content.
	- MEA ([arXiv:2601.19611](https://arxiv.org/abs/2601.19611)) — Explicit Multi-head
	Attention. Adds small learnable interaction matrices between attention heads for K and V.
	- LUCID ([arXiv:2602.10410](https://arxiv.org/abs/2602.10410)) — Applies a learned
	lower-triangular preconditioner to V before attention, decorrelating value representations
	across positions.
	- Affine-Scaled Attention ([arXiv:2602.23057](https://arxiv.org/abs/2602.23057)) — Adds
	two learnable per-head scalars (α and β) to the softmax weights:
	`[α·softmax(QKᵀ) + β]·V`.
	- XSA ([arXiv:2603.09078](https://arxiv.org/abs/2603.09078)) — Exclusive Self Attention.
	After computing attention, removes the component of the output aligned with the token's
	own value vector.
	- Directional Routing ([arXiv:2603.14923](https://arxiv.org/abs/2603.14923)) — Each head
	learns K=4 directions in the output space; a learned router suppresses the attention output
	along each direction per input.
	- Gated Attention ([arXiv:2505.06708](https://arxiv.org/abs/2505.06708)) — A sigmoid gate
	is applied to the attention output before the output projection, introducing non-linearity
	and preventing attention sinks.
	- Momentum Attention ([arXiv:2411.03884](https://arxiv.org/abs/2411.03884)) — Modifies Q
	and K by subtracting a fraction of the previous position's Q and K values (causal
	first-difference).

	MLP

	- Learnable Multipliers ([arXiv:2601.04890](https://arxiv.org/abs/2601.04890)) — Adds
	per-row and per-column learnable scalar parameters to each linear layer.
	- SimpleGPT ([arXiv:2602.01212](https://arxiv.org/abs/2602.01212)) — A normalization
	strategy derived from second-order geometry analysis, applied inside MLP projections to
	improve optimization stability.

	---

	## Training

	\| Setting \| Value \|
	\|---\|---\|
	\| Dataset \| FineWeb-Edu (sample-10BT) \|
	\| Tokens seen \| ~0.51B (15,625 steps × batch 64 × length 512) \|
	\| Precision \| FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback \|
	\| Optimizer \| Conda (Column-Normalized Adam) \|
	\| Learning rate \| 6e-04 with linear warmup (10 % of steps) \|
	\| Weight decay \| 0.1 \|
	\| Training time \| ~1h 18m \|
	\| Hardware \| NVIDIA RTX 5090 (single GPU) \|

	### Training curve

	\| Step \| Train Loss \| Val Loss \|
	\|---\|---\|---\|
	\| 5,000 \| 3.995 \| 3.957 \|
	\| 10,000 \| 3.725 \| 3.699 \|
	\| 15,000 \| 3.539 \| 3.501 \|
	\| 15,625 \| — \| 3.488 \|

	---

	## Limitations

	- Token budget — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks
	will improve with more training.
	- Gradient spike at step 40k — Reorganized the attention pattern in layer 9 that
	previously captured long-range token correlations. A checkpoint from ~step 38k is expected
	to have better aggregate benchmark scores.
	- PolyNorm exclusivity — The quadratic branch has become partially redundant with the
	linear branch. Will be corrected in the next training run.
	- Base model only — Not instruction-tuned or aligned; purely a next-token-prediction
	base model.

	---

	## References

	All papers whose techniques are integrated into NeoLLM's architecture:

	\| Technique \| Paper title \| arXiv \|
	\|---\|---\|---\|
	\| SeeDNorm \| Self-Rescaled Dynamic Normalization \| [2510.22777](https://arxiv.org/abs/2510.22777) \|
	\| MEA \| Explicit Multi-head Attention \| [2601.19611](https://arxiv.org/abs/2601.19611) \|
	\| Learnable Multipliers \| Freeing the Scale of Language Model Matrix Layers \| [2601.04890](https://arxiv.org/abs/2601.04890) \|
	\| Directional Routing \| Directional Routing in Transformers \| [2603.14923](https://arxiv.org/abs/2603.14923) \|
	\| XSA \| Exclusive Self Attention \| [2603.09078](https://arxiv.org/abs/2603.09078) \|
	\| Gated Attention \| Gated Attention for LLMs \| [2505.06708](https://arxiv.org/abs/2505.06708) \|
	\| Affine-Scaled Attention \| Affine-Scaled Attention \| [2602.23057](https://arxiv.org/abs/2602.23057) \|
	\| LNS \| The Curse of Depth in LLMs \| [2502.05795](https://arxiv.org/abs/2502.05795) \|
	\| LUCID \| Attention with Preconditioned Representations \| [2602.10410](https://arxiv.org/abs/2602.10410) \|
	\| FAN \| Fourier Analysis Networks \| [2502.21309](https://arxiv.org/abs/2502.21309) \|
	\| SimpleGPT \| SimpleGPT \| [2602.01212](https://arxiv.org/abs/2602.01212) \|
	\| GPAS \| Gradient-Preserving Activation Scaling \| [2506.22049](https://arxiv.org/abs/2506.22049) \|
	\| PolyNorm \| PolyNorm / PolyCom \| [2602.04902](https://arxiv.org/abs/2602.04902) \|
	\| Momentum Attention \| Momentum Attention \| [2411.03884](https://arxiv.org/abs/2411.03884) \|
	\| TWEO (analysis ref.) \| Transformers Without Extreme Outliers \| [2511.23225](https://arxiv.org/abs/2511.23225) \|

	---

	## Citation

	```bibtex
	@misc{neollm2026,
	title = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques},
	author = {KitsuVp},
	year = {2026},
	url = {https://huggingface.co/KitsuVp/NeoLLM}
	}
	```

	---

	## Author

	[@Kyokopom](https://x.com/Kyokopom) on X

	---

	## License

	Apache 2.0