Day 1
Geometric Terrain Statistics Composite
Such a quaint little tool.
class GeometricResidualModulator(nn.Module):
def __init__(self, d_model=512, vocab_size=32128, n_geometric_dims=64,
initial_alpha=0.01, n_layers=6):
super().__init__()
self.d_model = d_model
self.n_geometric_dims = n_geometric_dims
self.geometric_embed = nn.Embedding(vocab_size, n_geometric_dims)
self.proj = nn.Linear(n_geometric_dims, d_model, bias=False)
logit = math.log(initial_alpha / (1 - initial_alpha))
self.alpha = nn.Parameter(torch.full((n_layers,), logit))
nn.init.normal_(self.proj.weight, std=0.01)
def forward(self, residual, token_ids, layer_idx=0):
geo = self.geometric_embed(token_ids)
geo_projected = self.proj(geo)
a = torch.sigmoid(self.alpha[layer_idx])
return (1 - a) * residual + a * geo_projected
def geometric_residuals(self):
W = self.geometric_embed.weight
W_n = F.normalize(W, dim=1)
idx = torch.randperm(min(W.shape[0], 5000))[:5000]
sample = W_n[idx]
cos_mat = sample @ sample.T
tri = torch.triu_indices(len(idx), len(idx), offset=1)
flat_cos = cos_mat[tri[0], tri[1]]
norms = W.norm(dim=1)
centered = W - W.mean(dim=0)
cov = (centered.T @ centered) / W.shape[0]
eigvals = torch.linalg.eigvalsh(cov)
pr = (eigvals.sum() ** 2) / (eigvals ** 2).sum()
return {
'cos_mean': flat_cos.mean().item(),
'cos_std': flat_cos.std().item(),
'norm_mean': norms.mean().item(),
'pr_over_dim': (pr / self.n_geometric_dims).item(),
'alpha': torch.sigmoid(self.alpha).detach().cpu().numpy(),
}
class ModulatedT5Encoder(nn.Module):
def __init__(self, t5_encoder, modulator, modulate_layers=None):
super().__init__()
self.encoder = t5_encoder
self.modulator = modulator
if modulate_layers is None:
modulate_layers = list(range(len(t5_encoder.block)))
self.modulate_layers = set(modulate_layers)
def forward(self, input_ids, attention_mask=None, output_hidden_states=False, **kwargs):
hidden_states = self.encoder.embed_tokens(input_ids)
hidden_states = self.encoder.dropout(hidden_states)
if attention_mask is not None:
extended_attention_mask = attention_mask[:, None, None, :].to(dtype=hidden_states.dtype)
extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(hidden_states.dtype).min
else:
extended_attention_mask = None
all_hidden_states = [hidden_states] if output_hidden_states else None
position_bias = None
seq_length = input_ids.shape[1]
cache_position = torch.arange(seq_length, device=input_ids.device)
for i, block in enumerate(self.encoder.block):
if i in self.modulate_layers:
hidden_states = self.modulator(hidden_states, input_ids, layer_idx=i)
block_output = block(hidden_states, attention_mask=extended_attention_mask,
position_bias=position_bias, cache_position=cache_position)
hidden_states = block_output[0]
if position_bias is None:
for out in block_output[1:]:
if isinstance(out, torch.Tensor) and out.dim() == 4:
position_bias = out
break
if output_hidden_states:
all_hidden_states.append(hidden_states)
hidden_states = self.encoder.final_layer_norm(hidden_states)
hidden_states = self.encoder.dropout(hidden_states)
if output_hidden_states:
all_hidden_states.append(hidden_states)
return type('Output', (), {
'last_hidden_state': hidden_states,
'hidden_states': tuple(all_hidden_states) if all_hidden_states else None,
})()
N_GEO = 64
modulator = GeometricResidualModulator(
d_model=512, vocab_size=32128, n_geometric_dims=N_GEO,
initial_alpha=0.5, n_layers=6,
).to(device)
mod_encoder = ModulatedT5Encoder(
t5_encoder=model.encoder, modulator=modulator,
modulate_layers=[0, 1, 2, 3, 4, 5],
)
Document Purpose
Running catalog of geometric measurements across language models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.
I. Models Profiled
| Model | Params | Vocab | Hidden Dim | Layers | Architecture | Training Data |
|---|---|---|---|---|---|---|
| T5-Small | 60.5M | 32,128 | 512 | 6+6 enc-dec | Transformer (relative PE) | C4 |
| Qwen3.5-0.8B | 853M (752M LM + 100M ViT) | 248,320 | 1024 | DeltaNet + MoE | Multilingual + Vision | |
| Qwen3.5-4B | ~4B | 248,320 | 2560 | DeltaNet + MoE | Multilingual + Vision |
II. Embedding Geometry Metrics
II.1 Participation Ratio (Effective Dimensionality)
Formula: PR = (Σλᵢ)² / Σ(λᵢ²), where λᵢ are eigenvalues of the embedding covariance matrix.
Process: Center embeddings (subtract mean), compute covariance C = EᵀE / N, eigendecompose. PR counts effective number of dimensions used. PR/dim normalizes to [0, 1].
| Model | PR | PR / dim | Dims for 95% var |
|---|---|---|---|
| T5-Small (512d) | 287.2 | 0.561 | 379 (74.0%) |
| Qwen3.5-0.8B (1024d) | 547.7 | 0.535 | 893 (87.2%) |
| Qwen3.5-4B (2560d) | 812.4 | 0.317 | 2125 (83.0%) |
Finding: PR/dim ≈ 0.53–0.56 for smaller models. Appears to be a universal attractor for embedding dimensionality utilization.
II.2 Pairwise Cosine Similarity Distribution
Formula: cos(eᵢ, eⱼ) = (eᵢ · eⱼ) / (‖eᵢ‖ · ‖eⱼ‖), sampled over 5K random tokens (12.5M pairs).
Process: Random sample 5K token embeddings, L2-normalize, compute full pairwise cosine matrix, extract upper triangle.
| Model | Mean | Std | Median | 1% | 99% |
|---|---|---|---|---|---|
| T5-Small | 0.057 | 0.060 | 0.053 | -0.068 | 0.225 |
| Qwen3.5-0.8B | 0.195 | 0.085 | 0.197 | -0.016 | 0.408 |
| Qwen3.5-4B | 0.142 | 0.078 | 0.139 | -0.029 | 0.356 |
Finding: T5 is near-orthogonal (span corruption objective). Qwen has positive bias (autoregressive next-token prediction pushes shared "being a token" component).
II.3 Embedding Norm Distribution
Formula: ‖eᵢ‖₂ = √(Σeᵢⱼ²)
| Model | Mean Norm | Std | Min | Max |
|---|---|---|---|---|
| T5-Small | 520.15 | 69.84 | 243.31 | 1333.61 |
| Qwen3.5-0.8B | 0.627 | 0.062 | 0.347 | 1.057 |
| Qwen3.5-4B | 0.656 | 0.067 | 0.400 | 1.091 |
Note: T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm. This affects downstream metric scaling but not relational structure.
III. Simplex Geometry Metrics
III.1 Pentachoron Volume (Cayley-Menger Determinant)
Formula: For 5 points P₀...P₄, construct the bordered distance matrix:
D = | 0 1 1 1 1 1 |
| 1 0 d₀₁² d₀₂² d₀₃² d₀₄²|
| 1 d₁₀² 0 d₁₂² d₁₃² d₁₄²|
| 1 d₂₀² d₂₁² 0 d₂₃² d₂₄²|
| 1 d₃₀² d₃₁² d₃₂² 0 d₃₄²|
| 1 d₄₀² d₄₁² d₄₂² d₄₃² 0 |
Vol² = (-1)⁵ · det(D) / (2⁴ · (4!)²) = -det(D) / 9216
Vol = √(Vol²) if Vol² > 0, else invalid
Process: Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Compare to random Gaussian baseline (same norm distribution). Report CV (coefficient of variation = std/mean) and embed/random ratio.
| Model | Valid/1000 | CV | Embed/Random Ratio |
|---|---|---|---|
| T5-Small | 1000 | 0.233 | 0.855 |
| Qwen3.5-0.8B | 1000 | 0.208 | 0.984 |
| Qwen3.5-4B | 1000 | 0.222 | 0.988 |
Finding: CV 0.20–0.23 is a universal attractor. All models pack simplices with similar evenness regardless of architecture, scale, or training data. The "pentachoron packing constant."
III.2 Cross-Model Relational Structure
Formula: For shared tokens between two models, compute pairwise cosine matrices in each model's embedding space. Pearson correlation between flattened upper triangles measures relational preservation.
Process (Qwen 0.8B vs 4B): PCA 4B embeddings (2560→1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.
| Comparison | Relational Pearson | Digit Structure Pearson |
|---|---|---|
| Qwen 0.8B vs 4B (raw) | 0.920 | 0.904 |
| Qwen 0.8B vs 4B (Procrustes) | higher (post-alignment) | — |
Finding: Models at different scales learn the same relational geometry (r=0.92).
IV. Semantic Structure Metrics
IV.1 Digit Manifold
Formula: For digit tokens '0'–'9', compute all 45 pairwise cosines. Measure Pearson correlation between |i−j| (numerical distance) and cosine similarity.
Process: Encode each digit as single token, extract embedding, normalize, compute pairwise cosine matrix.
| Model | |i−j| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap | |---|---|---|---|---| | T5-Small | -0.575 | 0.622 | 0.442 | 0.180 | | Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 | | Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |
Finding: All models encode a number line. Stronger in Qwen (more training data). T5 has wider gap (adjacent vs non-adjacent more differentiated) despite weaker overall correlation.
IV.2 Semantic Category Clustering
Formula: For tokens in a semantic category, compute mean intra-category pairwise cosine. Compare to global mean pairwise cosine. Lift = intra − global.
Process (T5-Small): 8 hand-curated categories (animals, colors, numbers, body, food, emotions, actions, time), single-token words only.
| Category | N tokens | Intra Cosine | Global | Lift |
|---|---|---|---|---|
| numbers | 9 | 0.497 | 0.057 | +0.440 |
| colors | 10 | 0.421 | 0.057 | +0.365 |
| time | 10 | 0.351 | 0.057 | +0.294 |
| food | 10 | 0.248 | 0.057 | +0.191 |
| animals | 12 | 0.241 | 0.057 | +0.184 |
| body | 10 | 0.216 | 0.057 | +0.159 |
| emotions | 10 | 0.197 | 0.057 | +0.141 |
| actions | 9 | 0.183 | 0.057 | +0.126 |
V. Encoder Transformation Metrics (T5-Small)
V.1 Layer-by-Layer Geometry
Process: Feed 10 diverse sentences through encoder, capture hidden states at each layer. Measure mean norm and mean pairwise cosine between token positions.
| Layer | Mean Norm | Pairwise Cosine |
|---|---|---|
| 0 (embed) | 377.3 | 0.052 |
| 1 | 761.6 | 0.278 |
| 2 | 1092.6 | 0.330 |
| 3 | 1428.8 | 0.367 |
| 4 | 1829.1 | 0.382 |
| 5 | 2378.3 | 0.419 |
| 6 (post-LN) | 3.3 | 0.211 |
Finding: Norms balloon through depth, final LayerNorm crushes to ~3. Pairwise cosine increases monotonically — tokens become MORE similar through depth. The encoder is a convergence funnel.
V.2 WordNet Relational Alignment
Process: Encode 9,362 WordNet definitions via "summarize: {definition}". Mean-pool encoder output. Compare pairwise cosine to WordNet path similarity.
| Representation | Pearson | Spearman |
|---|---|---|
| Static embeddings | 0.078 | 0.015 |
| Encoder output | 0.095 | 0.081 |
50-seed stability (encoder): Pearson 0.100 ± 0.008, Spearman 0.090 ± 0.010, CV 0.204 ± 0.006.
V.3 Encoder Distance Bands
Process: Group WordNet token pairs by path similarity ranges. Measure mean cosine in each band.
| WN Similarity Band | N pairs | Static Cosine | Encoder Cosine | Lift |
|---|---|---|---|---|
| [0.50, 0.90) | 23 | 0.244 | 0.728 | +0.484 |
| [0.25, 0.50) | 53,112 | 0.077 | 0.573 | +0.496 |
| [0.10, 0.25) | 145,035 | 0.060 | 0.565 | +0.505 |
| [0.05, 0.10) | 295,680 | 0.061 | 0.553 | +0.492 |
V.4 Hypernym Chain Decay
Process: Find WordNet synsets forming hypernym chains (e.g., dog→canine→mammal→organism). Measure cosine between root and ancestor at each depth.
| Depth | Static Cosine | Encoder Cosine |
|---|---|---|
| 1 | 0.160 | 0.656 |
| 2 | 0.090 | 0.620 |
| 3 | 0.075 | 0.594 |
| 5 | 0.069 | 0.585 |
| 7 | 0.068 | 0.579 |
Finding: Monotonic decay in both spaces. Encoder has much stronger signal and cleaner gradient.
VI. Inactive Weight Topology (T5-Small / T5-Base)
VI.1 SVD Effective Rank
Formula: Stable rank = ‖W‖²_F / ‖W‖²₂ = Σσᵢ² / σ₁². Measures effective rank without thresholding.
Process: SVD every 2D weight matrix. Report stable rank, participation ratio, active fraction (σᵢ > 0.01·σ₁), and condition number (σ₁/σₙ).
| Weight Type | Stable Rank (Small) | Stable Rank (Base) |
|---|---|---|
| self_attn_q | 47.6 ± 16.4 | 58.1 ± 17.2 |
| self_attn_k | 53.2 ± 9.2 | 62.4 ± 18.3 |
| self_attn_v | 75.3 | 97.5 |
| mlp_wi | 15.2 ± 3.8 | 20.6 ± 4.9 |
| mlp_wo | 31.3 | 43.9 |
VI.2 Sparsity Topology
Formula: Fraction of |wᵢⱼ| below threshold.
| Weight Type | <0.1 (Small) | <0.1 (Base) |
|---|---|---|
| self_attn_q | 93.7% | 99.4% |
| self_attn_k | 19.2% | 30.0% |
| self_attn_v | 12.1% | 16.2% |
| mlp_wi | 11.9% | 16.9% |
| Full model | 18.4% | 27.9% |
Finding: Q matrices are overwhelmingly sparse. The query projection is >93% empty. K matrices are dense. This asymmetry grows with scale. The Q null space is the intervention point for geometric modulation.
VI.3 QK Similarity Manifold
Formula: QK = W_Q · W_Kᵀ. Eigendecompose the symmetric part (QK + QKᵀ)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.
Process: Compute per-layer. Track positive/negative balance and stable rank.
| Layer (Encoder) | Stable Rank | Positive Eig | Negative Eig | Symmetry Dev |
|---|---|---|---|---|
| 0 | 39.5 | 315 | 197 | 0.993 |
| 2 | 10.1 | 269 | 243 | 1.217 |
| 5 | 5.35 | 274 | 238 | 1.252 |
Finding: Similarity function narrows through depth (stable rank 39→5). Negative eigenvalue count increases — deeper layers define more anti-similarity boundaries.
VI.4 MLP Dead Neurons
Formula: Combined importance = ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂. Dead if < 1% of mean.
Finding: Zero dead neurons across all layers, both encoder and decoder, at both Small and Base scale. T5 is parameter-starved — every neuron earns its keep.
VI.5 Position Bias Topology
Process: T5 uses learned relative position biases: [32 buckets, N heads]. Measure per-head: monotonicity, distance correlation, peak bucket.
Encoder (T5-Small): 3 local heads (peak 0-1, negative dist_corr), 2 global heads (peak 17-18, positive dist_corr), 3 mixed.
Decoder (T5-Small): 4 far-looking heads (peak 31, values up to +48), 4 local heads (peak 0-1, values down to -34.5). Extreme magnitude asymmetry — far-looking heads are 10× stronger.
Finding: This local/global split emerges identically across T5-Small, T5-Base. It's an architectural invariant.
VII. Geometric Residual Modulator
VII.1 Architecture
- Geometric embedding: [vocab_size, 64] — per-token geometric fingerprint
- Projection: Linear(64, d_model, bias=False) — Procrustes-aligned to encoder PCA space
- Alpha: per-layer learnable LERP coefficient, stored in logit space, applied via sigmoid
- Intervention: residual_out = (1 − α) · residual + α · proj(geo_embed(token_ids))
- Params: 2.09M (3.45% of T5-Small)
VII.2 Geometric Embedding Initialization
Process:
- Build 3000×3000 Wu-Palmer similarity matrix from WordNet anchors (~6 min)
- Eigendecompose → top 64 eigenvectors scaled by √eigenvalue → 64-d embeddings
- Project remaining tokens via GPU embedding cosine proxy (10-NN, softmax-weighted, <1 sec)
- Procrustes align projection matrix to encoder PCA space
| Metric | Value |
|---|---|
| WN reconstruction correlation | 0.921 |
| Procrustes alignment cosine | 0.372 |
| Eigenvalue cumulative (top 64) | 61.3% |
VII.3 Alpha Convergence
Process: Freeze T5, train only modulator (geometric embed + projection + alpha). Task: summarize definition → lemma word. Track alpha per layer.
| Start α | Final Mean α | Layer 5 Final | Pearson Δ | CV | Coherent | Basin |
|---|---|---|---|---|---|---|
| 0.01 (20 ep) | 0.067 | 0.107 | +0.151 | 0.220 | Yes | Binding |
| 0.20 (20 ep) | 0.222 | 0.308 | +0.085 | 0.452 | No | Ridge |
| 0.70 (20 ep) | 0.695 | 0.640 | -0.029 | 0.482 | No | Separation |
| 0.01 (100 ep) | 0.125 | 0.218 | +0.074 | 0.322 | No | Overfit |
Finding: Two stable attractor basins exist — binding (0.07) and separation (0.70). The binding basin produces functional results. Starting at 0.01 with early stopping (20 epochs) is optimal.
VII.4 Depth Gradient (Consistent Across All Runs)
| Layer | 20ep (α=0.01) | 100ep (α=0.01) | 20ep (α=0.20) |
|---|---|---|---|
| 0 | 0.015 | 0.035 | 0.170 |
| 1 | 0.052 | 0.061 | 0.180 |
| 2 | 0.066 | 0.102 | 0.227 |
| 3 | 0.080 | 0.137 | 0.197 |
| 4 | 0.080 | 0.197 | 0.248 |
| 5 | 0.107 | 0.218 | 0.308 |
Finding: Always monotonically increasing. The model wants minimal geometric modulation early and maximum modulation at the deepest layer. Geometry is a final correction, not an initial condition.
VII.5 Best Result
| Metric | Original | Modulated (20ep, α=0.01 start) | Change |
|---|---|---|---|
| WordNet Pearson | 0.099 | 0.250 | +152% |
| WordNet Spearman | 0.085 | 0.245 | +189% |
| Semantic Gradient | 0.022 | 0.052 | +132% |
| Pentachoron CV | 0.202 | 0.220 | Stayed in band |
| Per-token Preservation | — | 0.730 | — |
| Coherence | Baseline | Identical on 4/4 tests | — |
VIII. The 0.29154 Constant
VIII.1 Observations Across Systems
| System | Context | Value |
|---|---|---|
| MinimalShunts | CLIP-L ↔ CLIP-G projection gate | Emergent equilibrium |
| Wormhole Lambda | Vision transformer training | Converges from 0.74 toward ~0.29 |
| Alpha curriculum | Devil's Staircase PE training | Converges to ~0.50 under geometric loss, CE destroys |
| T5 generation | Greedy decode alpha sweep | Stable plateau at 0.291–0.292, semantic phase transition |
VIII.2 T5 Generation Phase Transition
| Alpha | Output (triangle prompt) |
|---|---|
| 0.01–0.10 | "triangle is a polygon with three edges and three vertices. it is one of the basic shapes in geometry." |
| 0.20 | "a triangle is a polygon with three edges and three vertices..." |
| 0.28 | "a polygon with three vertices. it is one of the basic shapes in a graph." |
| 0.291 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in a graph." |
| 0.2915 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in a graph." |
| 0.292 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in the world." |
| 0.30 | "a polygon with a vertice and a vertice. it is one of the basic shapes in the world." |
Finding: 0.29154 marks the phase boundary between structural representation ("graph") and physical representation ("world"). Output is invariant to perturbation in a narrow band centered on the constant.
IX. Universal Geometric Constants
| Constant | Value | Observed In |
|---|---|---|
| Pentachoron CV | 0.20–0.23 | T5-Small, Qwen 0.8B, Qwen 4B, trained modulator |
| Participation / dim | 0.53–0.56 | T5-Small, Qwen 0.8B |
| Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
| Depth gradient | Monotonic increasing | All modulator training runs |
| Q sparsity scaling | Increases with model scale | T5-Small (93.7%), T5-Base (99.4%) |
X. Measurement Toolkit Reference
| Tool | Input | Output | Requires Inference |
|---|---|---|---|
| Participation Ratio | Embedding matrix | Effective dimensionality | No |
| Cayley-Menger Volume | 5-point subsets of embeddings | Simplex volume + CV | No |
| Pairwise Cosine | Embedding matrix (sampled) | Similarity distribution | No |
| Digit Manifold | 10 digit token embeddings | i−j | |
| SVD Effective Rank | Any 2D weight matrix | Stable rank, condition number | No |
| QK Manifold | W_Q, W_K matrices | Eigenspectrum, pos/neg balance | No |
| Dead Neuron Count | MLP wi, wo matrices | Combined importance distribution | No |
| WordNet Relational | Encoder output (mean-pooled) | Pearson/Spearman vs path similarity | Yes |
| Alpha Convergence | Modulator training loop | Per-layer equilibrium values | Yes (training) |
Last updated: 2026-03-05 Models profiled: 3 (T5-Small, Qwen3.5-0.8B, Qwen3.5-4B) Modulator experiments: 4 configurations