Day 1

Geometric Terrain Statistics Composite

Such a quaint little tool.

class GeometricResidualModulator(nn.Module):
    def __init__(self, d_model=512, vocab_size=32128, n_geometric_dims=64,
                 initial_alpha=0.01, n_layers=6):
        super().__init__()
        self.d_model = d_model
        self.n_geometric_dims = n_geometric_dims
        self.geometric_embed = nn.Embedding(vocab_size, n_geometric_dims)
        self.proj = nn.Linear(n_geometric_dims, d_model, bias=False)
        logit = math.log(initial_alpha / (1 - initial_alpha))
        self.alpha = nn.Parameter(torch.full((n_layers,), logit))
        nn.init.normal_(self.proj.weight, std=0.01)

    def forward(self, residual, token_ids, layer_idx=0):
        geo = self.geometric_embed(token_ids)
        geo_projected = self.proj(geo)
        a = torch.sigmoid(self.alpha[layer_idx])
        return (1 - a) * residual + a * geo_projected

    def geometric_residuals(self):
        W = self.geometric_embed.weight
        W_n = F.normalize(W, dim=1)
        idx = torch.randperm(min(W.shape[0], 5000))[:5000]
        sample = W_n[idx]
        cos_mat = sample @ sample.T
        tri = torch.triu_indices(len(idx), len(idx), offset=1)
        flat_cos = cos_mat[tri[0], tri[1]]
        norms = W.norm(dim=1)
        centered = W - W.mean(dim=0)
        cov = (centered.T @ centered) / W.shape[0]
        eigvals = torch.linalg.eigvalsh(cov)
        pr = (eigvals.sum() ** 2) / (eigvals ** 2).sum()
        return {
            'cos_mean': flat_cos.mean().item(),
            'cos_std': flat_cos.std().item(),
            'norm_mean': norms.mean().item(),
            'pr_over_dim': (pr / self.n_geometric_dims).item(),
            'alpha': torch.sigmoid(self.alpha).detach().cpu().numpy(),
        }


class ModulatedT5Encoder(nn.Module):
    def __init__(self, t5_encoder, modulator, modulate_layers=None):
        super().__init__()
        self.encoder = t5_encoder
        self.modulator = modulator
        if modulate_layers is None:
            modulate_layers = list(range(len(t5_encoder.block)))
        self.modulate_layers = set(modulate_layers)

    def forward(self, input_ids, attention_mask=None, output_hidden_states=False, **kwargs):
        hidden_states = self.encoder.embed_tokens(input_ids)
        hidden_states = self.encoder.dropout(hidden_states)

        if attention_mask is not None:
            extended_attention_mask = attention_mask[:, None, None, :].to(dtype=hidden_states.dtype)
            extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(hidden_states.dtype).min
        else:
            extended_attention_mask = None

        all_hidden_states = [hidden_states] if output_hidden_states else None
        position_bias = None
        seq_length = input_ids.shape[1]
        cache_position = torch.arange(seq_length, device=input_ids.device)

        for i, block in enumerate(self.encoder.block):
            if i in self.modulate_layers:
                hidden_states = self.modulator(hidden_states, input_ids, layer_idx=i)

            block_output = block(hidden_states, attention_mask=extended_attention_mask,
                                 position_bias=position_bias, cache_position=cache_position)
            hidden_states = block_output[0]

            if position_bias is None:
                for out in block_output[1:]:
                    if isinstance(out, torch.Tensor) and out.dim() == 4:
                        position_bias = out
                        break

            if output_hidden_states:
                all_hidden_states.append(hidden_states)

        hidden_states = self.encoder.final_layer_norm(hidden_states)
        hidden_states = self.encoder.dropout(hidden_states)

        if output_hidden_states:
            all_hidden_states.append(hidden_states)

        return type('Output', (), {
            'last_hidden_state': hidden_states,
            'hidden_states': tuple(all_hidden_states) if all_hidden_states else None,
        })()


N_GEO = 64
modulator = GeometricResidualModulator(
    d_model=512, vocab_size=32128, n_geometric_dims=N_GEO,
    initial_alpha=0.5, n_layers=6,
).to(device)

mod_encoder = ModulatedT5Encoder(
    t5_encoder=model.encoder, modulator=modulator,
    modulate_layers=[0, 1, 2, 3, 4, 5],
)

Document Purpose

Running catalog of geometric measurements across language models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.


I. Models Profiled

Model Params Vocab Hidden Dim Layers Architecture Training Data
T5-Small 60.5M 32,128 512 6+6 enc-dec Transformer (relative PE) C4
Qwen3.5-0.8B 853M (752M LM + 100M ViT) 248,320 1024 DeltaNet + MoE Multilingual + Vision
Qwen3.5-4B ~4B 248,320 2560 DeltaNet + MoE Multilingual + Vision

II. Embedding Geometry Metrics

II.1 Participation Ratio (Effective Dimensionality)

Formula: PR = (Σλᵢ)² / Σ(λᵢ²), where λᵢ are eigenvalues of the embedding covariance matrix.

Process: Center embeddings (subtract mean), compute covariance C = EᵀE / N, eigendecompose. PR counts effective number of dimensions used. PR/dim normalizes to [0, 1].

Model PR PR / dim Dims for 95% var
T5-Small (512d) 287.2 0.561 379 (74.0%)
Qwen3.5-0.8B (1024d) 547.7 0.535 893 (87.2%)
Qwen3.5-4B (2560d) 812.4 0.317 2125 (83.0%)

Finding: PR/dim ≈ 0.53–0.56 for smaller models. Appears to be a universal attractor for embedding dimensionality utilization.

II.2 Pairwise Cosine Similarity Distribution

Formula: cos(eᵢ, eⱼ) = (eᵢ · eⱼ) / (‖eᵢ‖ · ‖eⱼ‖), sampled over 5K random tokens (12.5M pairs).

Process: Random sample 5K token embeddings, L2-normalize, compute full pairwise cosine matrix, extract upper triangle.

Model Mean Std Median 1% 99%
T5-Small 0.057 0.060 0.053 -0.068 0.225
Qwen3.5-0.8B 0.195 0.085 0.197 -0.016 0.408
Qwen3.5-4B 0.142 0.078 0.139 -0.029 0.356

Finding: T5 is near-orthogonal (span corruption objective). Qwen has positive bias (autoregressive next-token prediction pushes shared "being a token" component).

II.3 Embedding Norm Distribution

Formula: ‖eᵢ‖₂ = √(Σeᵢⱼ²)

Model Mean Norm Std Min Max
T5-Small 520.15 69.84 243.31 1333.61
Qwen3.5-0.8B 0.627 0.062 0.347 1.057
Qwen3.5-4B 0.656 0.067 0.400 1.091

Note: T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm. This affects downstream metric scaling but not relational structure.


III. Simplex Geometry Metrics

III.1 Pentachoron Volume (Cayley-Menger Determinant)

Formula: For 5 points P₀...P₄, construct the bordered distance matrix:

D = | 0  1    1    1    1    1   |
    | 1  0    d₀₁² d₀₂² d₀₃² d₀₄²|
    | 1  d₁₀² 0    d₁₂² d₁₃² d₁₄²|
    | 1  d₂₀² d₂₁² 0    d₂₃² d₂₄²|
    | 1  d₃₀² d₃₁² d₃₂² 0    d₃₄²|
    | 1  d₄₀² d₄₁² d₄₂² d₄₃² 0   |

Vol² = (-1)⁵ · det(D) / (2⁴ · (4!)²) = -det(D) / 9216
Vol = √(Vol²) if Vol² > 0, else invalid

Process: Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Compare to random Gaussian baseline (same norm distribution). Report CV (coefficient of variation = std/mean) and embed/random ratio.

Model Valid/1000 CV Embed/Random Ratio
T5-Small 1000 0.233 0.855
Qwen3.5-0.8B 1000 0.208 0.984
Qwen3.5-4B 1000 0.222 0.988

Finding: CV 0.20–0.23 is a universal attractor. All models pack simplices with similar evenness regardless of architecture, scale, or training data. The "pentachoron packing constant."

III.2 Cross-Model Relational Structure

Formula: For shared tokens between two models, compute pairwise cosine matrices in each model's embedding space. Pearson correlation between flattened upper triangles measures relational preservation.

Process (Qwen 0.8B vs 4B): PCA 4B embeddings (2560→1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.

Comparison Relational Pearson Digit Structure Pearson
Qwen 0.8B vs 4B (raw) 0.920 0.904
Qwen 0.8B vs 4B (Procrustes) higher (post-alignment)

Finding: Models at different scales learn the same relational geometry (r=0.92).


IV. Semantic Structure Metrics

IV.1 Digit Manifold

Formula: For digit tokens '0'–'9', compute all 45 pairwise cosines. Measure Pearson correlation between |i−j| (numerical distance) and cosine similarity.

Process: Encode each digit as single token, extract embedding, normalize, compute pairwise cosine matrix.

| Model | |i−j| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap | |---|---|---|---|---| | T5-Small | -0.575 | 0.622 | 0.442 | 0.180 | | Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 | | Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |

Finding: All models encode a number line. Stronger in Qwen (more training data). T5 has wider gap (adjacent vs non-adjacent more differentiated) despite weaker overall correlation.

IV.2 Semantic Category Clustering

Formula: For tokens in a semantic category, compute mean intra-category pairwise cosine. Compare to global mean pairwise cosine. Lift = intra − global.

Process (T5-Small): 8 hand-curated categories (animals, colors, numbers, body, food, emotions, actions, time), single-token words only.

Category N tokens Intra Cosine Global Lift
numbers 9 0.497 0.057 +0.440
colors 10 0.421 0.057 +0.365
time 10 0.351 0.057 +0.294
food 10 0.248 0.057 +0.191
animals 12 0.241 0.057 +0.184
body 10 0.216 0.057 +0.159
emotions 10 0.197 0.057 +0.141
actions 9 0.183 0.057 +0.126

V. Encoder Transformation Metrics (T5-Small)

V.1 Layer-by-Layer Geometry

Process: Feed 10 diverse sentences through encoder, capture hidden states at each layer. Measure mean norm and mean pairwise cosine between token positions.

Layer Mean Norm Pairwise Cosine
0 (embed) 377.3 0.052
1 761.6 0.278
2 1092.6 0.330
3 1428.8 0.367
4 1829.1 0.382
5 2378.3 0.419
6 (post-LN) 3.3 0.211

Finding: Norms balloon through depth, final LayerNorm crushes to ~3. Pairwise cosine increases monotonically — tokens become MORE similar through depth. The encoder is a convergence funnel.

V.2 WordNet Relational Alignment

Process: Encode 9,362 WordNet definitions via "summarize: {definition}". Mean-pool encoder output. Compare pairwise cosine to WordNet path similarity.

Representation Pearson Spearman
Static embeddings 0.078 0.015
Encoder output 0.095 0.081

50-seed stability (encoder): Pearson 0.100 ± 0.008, Spearman 0.090 ± 0.010, CV 0.204 ± 0.006.

V.3 Encoder Distance Bands

Process: Group WordNet token pairs by path similarity ranges. Measure mean cosine in each band.

WN Similarity Band N pairs Static Cosine Encoder Cosine Lift
[0.50, 0.90) 23 0.244 0.728 +0.484
[0.25, 0.50) 53,112 0.077 0.573 +0.496
[0.10, 0.25) 145,035 0.060 0.565 +0.505
[0.05, 0.10) 295,680 0.061 0.553 +0.492

V.4 Hypernym Chain Decay

Process: Find WordNet synsets forming hypernym chains (e.g., dog→canine→mammal→organism). Measure cosine between root and ancestor at each depth.

Depth Static Cosine Encoder Cosine
1 0.160 0.656
2 0.090 0.620
3 0.075 0.594
5 0.069 0.585
7 0.068 0.579

Finding: Monotonic decay in both spaces. Encoder has much stronger signal and cleaner gradient.


VI. Inactive Weight Topology (T5-Small / T5-Base)

VI.1 SVD Effective Rank

Formula: Stable rank = ‖W‖²_F / ‖W‖²₂ = Σσᵢ² / σ₁². Measures effective rank without thresholding.

Process: SVD every 2D weight matrix. Report stable rank, participation ratio, active fraction (σᵢ > 0.01·σ₁), and condition number (σ₁/σₙ).

Weight Type Stable Rank (Small) Stable Rank (Base)
self_attn_q 47.6 ± 16.4 58.1 ± 17.2
self_attn_k 53.2 ± 9.2 62.4 ± 18.3
self_attn_v 75.3 97.5
mlp_wi 15.2 ± 3.8 20.6 ± 4.9
mlp_wo 31.3 43.9

VI.2 Sparsity Topology

Formula: Fraction of |wᵢⱼ| below threshold.

Weight Type <0.1 (Small) <0.1 (Base)
self_attn_q 93.7% 99.4%
self_attn_k 19.2% 30.0%
self_attn_v 12.1% 16.2%
mlp_wi 11.9% 16.9%
Full model 18.4% 27.9%

Finding: Q matrices are overwhelmingly sparse. The query projection is >93% empty. K matrices are dense. This asymmetry grows with scale. The Q null space is the intervention point for geometric modulation.

VI.3 QK Similarity Manifold

Formula: QK = W_Q · W_Kᵀ. Eigendecompose the symmetric part (QK + QKᵀ)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.

Process: Compute per-layer. Track positive/negative balance and stable rank.

Layer (Encoder) Stable Rank Positive Eig Negative Eig Symmetry Dev
0 39.5 315 197 0.993
2 10.1 269 243 1.217
5 5.35 274 238 1.252

Finding: Similarity function narrows through depth (stable rank 39→5). Negative eigenvalue count increases — deeper layers define more anti-similarity boundaries.

VI.4 MLP Dead Neurons

Formula: Combined importance = ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂. Dead if < 1% of mean.

Finding: Zero dead neurons across all layers, both encoder and decoder, at both Small and Base scale. T5 is parameter-starved — every neuron earns its keep.

VI.5 Position Bias Topology

Process: T5 uses learned relative position biases: [32 buckets, N heads]. Measure per-head: monotonicity, distance correlation, peak bucket.

Encoder (T5-Small): 3 local heads (peak 0-1, negative dist_corr), 2 global heads (peak 17-18, positive dist_corr), 3 mixed.

Decoder (T5-Small): 4 far-looking heads (peak 31, values up to +48), 4 local heads (peak 0-1, values down to -34.5). Extreme magnitude asymmetry — far-looking heads are 10× stronger.

Finding: This local/global split emerges identically across T5-Small, T5-Base. It's an architectural invariant.


VII. Geometric Residual Modulator

VII.1 Architecture

  • Geometric embedding: [vocab_size, 64] — per-token geometric fingerprint
  • Projection: Linear(64, d_model, bias=False) — Procrustes-aligned to encoder PCA space
  • Alpha: per-layer learnable LERP coefficient, stored in logit space, applied via sigmoid
  • Intervention: residual_out = (1 − α) · residual + α · proj(geo_embed(token_ids))
  • Params: 2.09M (3.45% of T5-Small)

VII.2 Geometric Embedding Initialization

Process:

  1. Build 3000×3000 Wu-Palmer similarity matrix from WordNet anchors (~6 min)
  2. Eigendecompose → top 64 eigenvectors scaled by √eigenvalue → 64-d embeddings
  3. Project remaining tokens via GPU embedding cosine proxy (10-NN, softmax-weighted, <1 sec)
  4. Procrustes align projection matrix to encoder PCA space
Metric Value
WN reconstruction correlation 0.921
Procrustes alignment cosine 0.372
Eigenvalue cumulative (top 64) 61.3%

VII.3 Alpha Convergence

Process: Freeze T5, train only modulator (geometric embed + projection + alpha). Task: summarize definition → lemma word. Track alpha per layer.

Start α Final Mean α Layer 5 Final Pearson Δ CV Coherent Basin
0.01 (20 ep) 0.067 0.107 +0.151 0.220 Yes Binding
0.20 (20 ep) 0.222 0.308 +0.085 0.452 No Ridge
0.70 (20 ep) 0.695 0.640 -0.029 0.482 No Separation
0.01 (100 ep) 0.125 0.218 +0.074 0.322 No Overfit

Finding: Two stable attractor basins exist — binding (0.07) and separation (0.70). The binding basin produces functional results. Starting at 0.01 with early stopping (20 epochs) is optimal.

VII.4 Depth Gradient (Consistent Across All Runs)

Layer 20ep (α=0.01) 100ep (α=0.01) 20ep (α=0.20)
0 0.015 0.035 0.170
1 0.052 0.061 0.180
2 0.066 0.102 0.227
3 0.080 0.137 0.197
4 0.080 0.197 0.248
5 0.107 0.218 0.308

Finding: Always monotonically increasing. The model wants minimal geometric modulation early and maximum modulation at the deepest layer. Geometry is a final correction, not an initial condition.

VII.5 Best Result

Metric Original Modulated (20ep, α=0.01 start) Change
WordNet Pearson 0.099 0.250 +152%
WordNet Spearman 0.085 0.245 +189%
Semantic Gradient 0.022 0.052 +132%
Pentachoron CV 0.202 0.220 Stayed in band
Per-token Preservation 0.730
Coherence Baseline Identical on 4/4 tests

VIII. The 0.29154 Constant

VIII.1 Observations Across Systems

System Context Value
MinimalShunts CLIP-L ↔ CLIP-G projection gate Emergent equilibrium
Wormhole Lambda Vision transformer training Converges from 0.74 toward ~0.29
Alpha curriculum Devil's Staircase PE training Converges to ~0.50 under geometric loss, CE destroys
T5 generation Greedy decode alpha sweep Stable plateau at 0.291–0.292, semantic phase transition

VIII.2 T5 Generation Phase Transition

Alpha Output (triangle prompt)
0.01–0.10 "triangle is a polygon with three edges and three vertices. it is one of the basic shapes in geometry."
0.20 "a triangle is a polygon with three edges and three vertices..."
0.28 "a polygon with three vertices. it is one of the basic shapes in a graph."
0.291 "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in a graph."
0.2915 "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in a graph."
0.292 "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in the world."
0.30 "a polygon with a vertice and a vertice. it is one of the basic shapes in the world."

Finding: 0.29154 marks the phase boundary between structural representation ("graph") and physical representation ("world"). Output is invariant to perturbation in a narrow band centered on the constant.


IX. Universal Geometric Constants

Constant Value Observed In
Pentachoron CV 0.20–0.23 T5-Small, Qwen 0.8B, Qwen 4B, trained modulator
Participation / dim 0.53–0.56 T5-Small, Qwen 0.8B
Binding/separation constant 0.29154 / 0.70846 MinimalShunts, CLIP projections, T5 generation, alpha convergence
Depth gradient Monotonic increasing All modulator training runs
Q sparsity scaling Increases with model scale T5-Small (93.7%), T5-Base (99.4%)

X. Measurement Toolkit Reference

Tool Input Output Requires Inference
Participation Ratio Embedding matrix Effective dimensionality No
Cayley-Menger Volume 5-point subsets of embeddings Simplex volume + CV No
Pairwise Cosine Embedding matrix (sampled) Similarity distribution No
Digit Manifold 10 digit token embeddings i−j
SVD Effective Rank Any 2D weight matrix Stable rank, condition number No
QK Manifold W_Q, W_K matrices Eigenspectrum, pos/neg balance No
Dead Neuron Count MLP wi, wo matrices Combined importance distribution No
WordNet Relational Encoder output (mean-pooled) Pearson/Spearman vs path similarity Yes
Alpha Convergence Modulator training loop Per-layer equilibrium values Yes (training)

Last updated: 2026-03-05 Models profiled: 3 (T5-Small, Qwen3.5-0.8B, Qwen3.5-4B) Modulator experiments: 4 configurations

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support