state as a matrix.

Modern language models work around the medium they compute on — flat vectors for state, and O(h) sequential steps for anything that composes. I'm building the version that doesn't.

The program has one representational commitment: model state as a d × d matrix, never a flat vector, and never flatten it back. A matrix has rank, so "how much has this model actually memorized" is a number you can compute from one forward pass instead of estimating from behavior. It's a fast weight — it can be written to and read from in-context, at inference time, with no gradient step. Four results now hang off that one commitment: a fixed 32,768-byte state holds K=32 recall at acc ≥ 0.998 across an 8× growing context while a matched transformer with an uncapped KV cache reads chance the whole time (no. 11); an in-context-written relation composes exactly to depth 61 (median per-seed failure front 253) via O(log h) reads — a separate timing probe (reads only, not exactness) clocks those reads 20.9× faster than the best O(h) baseline at depth 1,021, widening to ~13,000–25,000× at depth 2²⁰ (no. 14); across five finite groups with provable minimum representation dimensions, the rank SGD recruits tracks that minimum exactly (the tie-capped maximum), and forcing it one dimension short drives recovery to zero in all five (no. 07). None of that would be trustworthy without the fourth thread: three separate times, a result that looked like a capability win turned out to be an instrument bug, and the fix — not the cover-up — is published alongside the number (no. 15).

ICML 2026 MI workshop paper published · capability-separation program active · 15 research notes, 3 self-caught instrument bugs, 0 swept under the rug

arc

One representational commitment, four stages.

The program started as three parallel threads — bytes, matrices, and matrix-tracked reasoning — in February 2026. Two of those threads resolved. Byte-level input is on hold (Stage 01, below): a real cost worth paying eventually, never bundled with the matrix question. The bolt-on version of matrix-tracked reasoning was tested and falsified in April, cleanly enough to publish. What survived, and is now the whole program, is narrower and stronger than the original pitch: model state as a matrix, never flattened, trained from scratch rather than bolted onto a vector-pretrained model. Stages 03 and 04 below are where the last four months of work actually happened.

01 bytes

Replace the tokenizer.

Subword tokenization is a learned compression layer between text and the model. It carries assumptions — language-shaped, English-shaped, web-shaped — and rules out raw audio waveforms, raw pixels, and arbitrary binary formats without a separate encoder per modality. Byte-level models drop that layer. BLT (Meta, 2024) replaced Chameleon's VQ tokenizer with byte patches at 8B parameters; EvaByte and MambaByte demonstrated byte-level processing at smaller scale.

Bytes introduce three implementation costs: sequence length grows ~3.5× over subword tokens for the same text and orders of magnitude more for audio and video, standard O(L²) attention does not scale to byte-length contexts, and a single byte carries near-zero signal so the model must reconstruct relational structure at every layer. The third cost is what motivates Stage 02.

02 matrices

Replace flat vectors with d × d matrices.

A matrix has rank, condition number, and singular values; a flat vector of the same dimension has none of these as differentiable observables. A d²-dim vector and a d × d matrix carry the same number of scalars, but only the matrix exposes structure that downstream operations can preserve or destroy. The architectural commitment: every primitive in the model — embedding, attention, FFN, output head — operates on matrices without flattening. Composition is matrix multiplication, attention scores are Frobenius inner products, the output head is a bilinear probe.

Two findings make this load-bearing. Outer-product matrix embeddings beat flat-vector embeddings by 26% lower BPB at 2.2× fewer parameters in a param-matched ablation, and by 175× in perplexity on a larger reasoning corpus (Finding no. 01). A matrix transformer layer at d = 16 uses 128× fewer parameters than the flattened d²×d² linear equivalent — a per-layer parameter-count claim, not a matched-capacity or matched-compute claim (Finding no. 04), shifting the parameter budget from embeddings to thinking layers.

03 capability

Bolt-on superposition was falsified. Trained-from-scratch capability separation is what replaced it.

Continuous chain-of-thought models — COCONUT (Meta, 2024), CODI (Meta, 2025) — feed back continuous latent vectors instead of discrete tokens, letting the model perform inference-time compute internally. The theoretical claim (Reasoning by Superposition, CoT2, 2025) is that these latents hold multiple reasoning paths in superposition, exploring a search tree in parallel. A d × d thought Z that holds k linearly independent patterns has rank k by construction — computable from a single sample via SVD, unlike a flat vector. The matrix-CODI experiment (Finding no. 05, ICML 2026 MI workshop) built that probe and ran the test on a bolt-on matrix bottleneck inside a vector-pretrained, vector-teacher CODI pipeline. Result: rank-k ablation curves are flat across two tasks and four readout designs, and best ProsQA accuracy decouples from final effective rank across seeds (81.5 ± 1.2pp at ranks {4, 12, 13}) — the CODI training objective produces rank-indifferent gradients through the full chain rule. Published, and does not decide the broader question.

The broader question moved off bolt-on distillation entirely: train a matrix-state model from scratch, on tasks where rank is a provable, not assumed, requirement. Across five finite groups with known minimum representation dimensions, recruited rank tracks the minimum exactly (Spearman ρ = 0.9747, the tie-capped maximum), and forcing rank one dimension short drives recovery to exactly zero in all five (no. 07). The same fast-weight family separates cleanly from param-matched and FLOP-matched baselines on high-load recall (no. 08), holds that recall on a state fixed at 32 KB out to 8× the binding window while a growing KV cache does not help a matched transformer at all (no. 11), and composes an in-context-written relation exactly at O(log h) cost against O(h) baselines that must apply their step h times (no. 14) — a result that so far is a clean WIN at K=8, a partial win at K=12, and, extended to a bank of three relations, converges only after a targeted trainability fix and remains seed-bimodal at far depth; the same fix, tested directly on the K axis, moves the trainability wall exactly one rung (K=15 scales, K=16/K=24 don't) before re-forming (no. 16). Compositional-depth generalization on a different architecture (a recurrent composer over the same finite groups) holds everywhere the model actually converges but does not yet clear as a general capability claim (no. 13) — the honest state of that thread, not a result to round up.

Supporting framing: Fedorenko et al. (Nature, 2024) dissociates the human language network from reasoning, math, and theory of mind — evidence that reasoning can run on representations that are not language-shaped.

04 measurement

Every capability claim above survived an attempt to break the instrument that measured it.

Three separate times this program built a readout, got a result that looked like a capability win or a clean null, and then found the instrument itself was wrong before publishing. A 62-cell compositional-depth sweep read as an outright failure until its own pre-registered crosscheck contradicted the primary readout by a full 0-vs-1.0 margin on exactly the cells that had actually converged — the primary lens's identity-gauge assumption, validated on a different architecture, silently failed on this one (no. 12). The diagnosis that followed traced every remaining miss to a named trainability mechanism rather than an architecture limit, and the promotion call landed as a qualified positive, not a capability win rounded up from a null (no. 13). Most strikingly, a published "recovery reads zero at every scale" null — the reasoning-link result, closed once already — turned out to rest on the same class of state-layout transpose bug three separate times across the codebase; fixing it reopened a real, nonzero signal, and then two independent correspondence-null controls, pre-registered before either ran, showed that signal itself was trivial (a collapsed prediction direction against near-collinear values). The corrected, now doubly-validated claim is "null-indistinguishable," never "reads zero" — a more precise finding than either the original null or the naive reopening (no. 15). None of this is a caveat bolted on after the fact. It is the actual discipline the capability numbers in Stage 03 were produced under, and it is published with the same care as the wins.

active now

What's actually running.

This replaces an older, more speculative roadmap (byte-level JEPA, cross-modality transfer at scale) that got superseded by the capability-separation pivot — that speculative material is preserved in the repo but is not what the GPUs are doing. Everything below is a live campaign with its own pre-registered decision gate.

done →

Matrix-native from scratch on a rank-K task

The matrix-CODI negative result localized the failure to the linear-in-Z readout. This experiment removed that readout entirely: fast-weight matrix-state models trained from scratch on tasks with provable rank requirements. Result: SGD recruits exactly the provably-necessary rank, and the recruitment is causal — recovery is exactly zero one dimension below the group-theoretic minimum (Finding no. 07). The same model family separates from param-matched flat-vector and FLOP-matched transformer baselines on high-load recall (Finding no. 08, confirmed at n=3), holds that recall on a fixed 32 KB state out to 8× the binding window (Finding no. 11), and composes exactly at O(log h) read cost (Finding no. 14). The full M×H fanout behind no. 11 (90 cells) has also completed: 0 of 90 clear the demonstration bar at any KV-cache cap or horizon — the transformer baseline never engages, at any setting, so the "no memory-multiplier number claimed" verdict holds down to the seed level.

~85 H100-hours realized

done →

Operator-bank seed replication and far-depth precision

Finding no. 16 recovered convergence for a 3-relation operator bank at n=1 and left two things open: whether the fix (annealed inter-hop LayerNorm) holds across seeds, and why in-distribution-perfect operators still miss the strict recovery bar at far depth. The replication has run to n=9 total: convergence holds 9/9 (0/9 dead seeds, vs the K=12 single-relation precedent's 2/10), and far depth at h*=61 resolves as seed-bimodal rather than uniformly open — 2/9 seeds clear the strict 0.9 bar, split cleanly by an observed (not shown causal) pairing with each seed's converged phase residual.

3.94 H100-hours realized

done →

Stage-2 compositional depth: the full 62-cell grid

The 11-cell calibration gate plus its 51-cell remainder — together the full compositional-depth grid — completed and reached a ratified verdict: INCONCLUSIVE-TRAINABILITY-LIMITED (findings no. 12–13). Exact composition holds wherever training converges; no general capability win is claimed. The mechanical primary-lens readout said FALSIFY, but instrument-health adjudication found it was reading a broken instrument on exactly the perfectly-converged cells (finding no. 12) — the corrected diagnosis (finding no. 13) named every remaining miss as a trainability event, not an architecture limit.

complete

done →

NCR earlyln K-scaling: does the bank's trainability fix extend to the K axis?

Finding no. 16's §06 write-capacity probe left an explicit next step: re-run the plain-recipe-DEAD K=15/16 rungs with the same annealed-LayerNorm fix (earlyln) that recovered the R=3 bank, at ≥4 seeds, plus a new untested rung. That 16-cell run (K∈{14,15,16,24} × 4 seeds) has completed and reached a pre-registered, worst-of-scored verdict: TRAINABILITY-STILL-LIMITED — the fix does not unlock the K axis wholesale. It does move the wall exactly one rung: K=15 goes from plain-recipe DEAD to 4/4 converged and holds far-depth composition (median 0.9929 at h*=117); the wall re-forms at K=16 (1/4 converged) and K=24 (0/4, a distinct partial-formation failure profile, not the plain recipe's flat-loss collapse). A bonus finding: earlyln improves write quality, not just convergence, at every rung it converges. Full record in finding no. 16 §07.

7.06 H100-hours realized (of the ≤8 hard cap)

done →

NCR next-lever wave: WHY does the K-scaling wall re-form?

§07 stopped with two priced probes and a live question: is the K=16/K=24 wall a budget limit, or something else? A 2× budget/anneal probe (8 cells) moved K=16's convergence rate 1/4 → 3/4 but left far depth at exactly 0.0, and made K=24 flat-to-worse — evidence against a simple budget law. A pre-registered design round then set up a genuine three-story discriminator, attack-audited before any cell launched (1 CRITICAL conceded and restructured, 2 SERIOUS, all dispositioned): does tight-spare state sizing help or hurt (two opposing pinned mechanisms), or is the wall an absolute-K ceiling regardless of state size? 20 cells (≈13.61 GPU-h) answered mechanically off the pre-registered map: a 4× budget probe at K=16 came back NO-LAW — write residual got worse in 3/4 seeds from 2× to 4×, and a 2×-converged seed's failure front regressed. Retraining at the tight-spare d=K+1 mapping instead of the conventional d=2K CONFIRMED at both K=16 and K=24: 4/4 converged on 1× budget alone at both K, where d=2K needed 4× budget (K=16) or never converged at all (K=24) — and K=16 under tight-spare reached genuine far-depth composition (0.80-0.99 recovered@0.9 in 2/4 seeds at h*=125) that no d=2K cell ever reached at any budget or anneal-shape tried. A parallel anneal-shape probe partially confirmed at K=16, falsified at K=24. Verdict: the K-scaling wall was substantially an artifact of the d=2K state-sizing convention at the two K values tested (K=16, K=24), not absolute K and not raw compute — within that range. The correct d(K) mapping law within K≤24 and the mechanism behind d=2K's failure remain open; K>24 behavior does NOT remain open — it has since been tested, see the next phase. Full record in finding no. 16 §08.

13.61 H100-hours realized (of the 14.05 nominal, ≤20 cap)

done →

NCR mapping-law grid: does the tight-spare fix generalize past K=24?

§08's own disclosed gap named this test directly: the tight-spare d=K+1 mapping confirmed at K=16 and K=24 — does it keep working at K=32? 12 cells (K=32 × d∈{K+1, 1.25K, 1.5K} × 4 seeds, queue jobs 009–020, ≈6.37 GPU-h) answered NO: every cell is dead at every d tried, including Gate-1 in-distribution convergence itself (0/12 CONVERGED; best cell 0.871 recovered@0.9, under the 0.9 bar; write-residual range 0.40–1.31; failure front pinned at 29, =K−3, the trivial train-residue rung, in all 12 cells; far-depth recovery 0.0 everywhere). This is a stronger negative than "converges but doesn't compose far" — basic convergence itself fails at K=32 even at the tightest spare mapping. The K≤24 tight-spare CONFIRM (finding no. 16 §08) stands unchanged; it does not generalize past K=24 — a K-dependent component of the wall re-emerges beyond it. Full record in finding no. 16 §09.

6.37 H100-hours realized

closed →

NCR K-axis: budget-rescue probe closes the book at K=32

The final pre-registered probe on this axis: does extra training budget rescue K=32's tight-spare (d=K+1) arm, the "least dead" cell class in the mapping-law grid? 8 cells (K=32/d=33, budgets {2×, 4×} × 4 seeds, ≈12.08 GPU-h) answered no — Gate-1 convergence improves directionally (0/4 → 1/4 → 2/4 CONVERGED) but never reaches the 3/4 CONVERGED-ROBUST bar, and the failure front is pinned at the trivial K−3 rung in all 12 cells now on record for this arm across all three budgets tested, zero far-depth recovery anywhere. The probe's own pre-registered ANOMALY check fired (write residual non-monotonic across budget in 3/4 seeds — the same signature seen once before at K=16's 4× probe), pre-empting a clean four-way verdict label by the design's own binding rule; reported plainly as the pre-registration working, not a dodge. The practical question resolves anyway: the sole outcome that could re-open the ladder needs both a robust convergence rate AND a far-depth front at or past h*=253, and neither is met at any budget — so K=48's own grid and the rest of the parked K≥48 backlog (~144 GPU-h) stay blocked. Verdict: K-AXIS CLOSED AT K=32. The d=K+1 convention fix is real and mechanistically explained (7–14× lower write-leakage) — bounded to K≤24; an unidentified K-dependent factor blocks K=32 at every mapping tried, mechanism unknown. One genuine positive banked alongside the closure: seed-extending K=24's tight-spare cells to n=12 found far-depth reliability is predictable from the write residual alone (Spearman ρ≈−0.877/−0.873) — a free, validated instrument. Full record in finding no. 16 §10.

16.12 H100-hours realized (12.08 budget-rescue + 4.03 seed extension)

on hold →

Byte-level input (Thread 01)

Raw-byte processing without a subword tokenizer remains a real, motivated direction — BLT (Meta, 2024) and MambaByte both show it works at scale — but it is deliberately not bundled with the matrix-representation question. Revisiting it means testing tokenization as its own axis, on its own evidence, after the current capability-separation campaigns reach a verdict.

not scheduled

dead ends

Directions the evidence has ruled out.

Negative results are data. These are the specific hypotheses I tested and ruled out, which narrowed the current direction. Each one was a pre-registered experiment with a clear falsification criterion.

PHM learned algebraic structure

Parameterized Hypercomplex Multiplication layers were supposed to learn quaternion-like algebra. Instead they converge to nilpotent structure — the optimizer treats PHM as a low-rank factorization rather than a learned algebra. CliffordNet (2026) confirmed the same thing from the other direction: algebraic structure works when fixed, not when learned.

PonderNet halting at small scale

Adaptive halting mechanisms collapse to "always stop at step 1" below ~10M parameters. Expected steps converges to 1.0. Use fixed iteration counts or LoopFormer-style consistency training instead.

3D matrix attention

Computing per-pair matrix products as attention scores drives representations to collapse to rank 1 and produces worse predictions. Dead end confirmed across multiple runs and scales.

Cross-domain transfer (original framing)

The original "matrix structure enables cross-domain transfer" hypothesis was ruled out by six fatal attacks before any experiment ran. The question survives as a research direction but needed a sharper formulation and larger scale to be testable.

Thought interleaving at toy scale

CoCoMix-style thought interleaving mechanisms don't produce benefits below ~10M parameters. The mechanism works but the scale is insufficient to measure its benefit.

Learned byte segmentation at small scale

Learned segmentation boundaries consistently underperform fixed-stride segmentation at the scales I tested. The learning signal isn't strong enough below ~100M parameters. BLT works at 8B, so this may reverse at scale — but not within this compute regime.

Frozen-bias fixes for the write-geometry attractor

Key-write geometry collapses more with scale, not less (span_frac 0.248 at 14M → 0.455 at 1.31B). A pre-registered fix-at-scale wave tested frozen-bias constructions at 98M and 392M — the one that stabilized the attractor at 14M decays to near-zero and sign-flips by 392M; the deployed variant stays destabilizing at every scale tested. No tested fix survives scale-up. Val-loss stays neutral throughout, which is the only half of the story that transfers.

Key-anchoring above K/d = 0.5

An explicit entity-anchor table matches learned behavior at low load, but a frozen, never-trained random anchor table performs identically — the effect is arithmetic constancy in the key blend, not learned entity alignment. It does not transplant to higher load: recovery falls off a cliff from ≈1.00 to ≈0.02 between K/d = 0.5 and K/d = 0.75.

Learned depth selection for exact composition

Killed analytically before any GPU time, not empirically: a soft (differentiable) mixture over composition depths is a matrix polynomial, not the exact matrix power the whole exact-composition claim depends on, and a hard/discrete selector reproduces this project's own documented PonderNet-halting collapse. Every composition finding on this site supplies depth as an explicit input signal for this reason.

Modern language models work around the medium they compute on — flat vectors for state, and O(h) sequential steps for anything that composes. I'm building the version that doesn't.

One representational commitment, four stages.

Replace the tokenizer.

Replace flat vectors with d × d matrices.

Bolt-on superposition was falsified. Trained-from-scratch capability separation is what replaced it.

Every capability claim above survived an attempt to break the instrument that measured it.

Findings.

Operator bank: architecture proven, exact contender convergence-recovered, far depth still open

Null that survives is the one you can trust

O(log h) reads: 20.9× faster at depth 1,021 — and it composes exactly

Composes exactly at 8× training depth, where it converges

When the instrument is the failure: a 0-vs-1.0 readout contradiction on perfectly converged models

Constant-memory recall: a fixed 32 KB state holds at 8× the binding window

Fast-weight memory capacity is super-linear in state dimension

Write-geometry attractor grows with scale, it is not a qk-norm artifact — and the tested fixes do not survive scale

Fast-weight recall separation at matched budgets, and where the binding actually lives

Rank law: recruited state dimension tracks the group-theoretic minimum, causally

The gradient does not see rank: rank-indifference in matrix-CODI on ProsQA

Matrix operations are 128× more parameter-efficient per layer

Output mechanism shapes rank trajectory in matrix-valued refinement

Rank enrichment is an emergent, novel phenomenon

Outer-product matrix embeddings outperform flat vectors per parameter

A reading of the field.

What's actually running.

Matrix-native from scratch on a rank-K task

Operator-bank seed replication and far-depth precision

Stage-2 compositional depth: the full 62-cell grid

NCR earlyln K-scaling: does the bank's trainability fix extend to the K axis?

NCR next-lever wave: WHY does the K-scaling wall re-form?

NCR mapping-law grid: does the tight-spare fix generalize past K=24?

NCR K-axis: budget-rescue probe closes the book at K=32

Byte-level input (Thread 01)

Directions the evidence has ruled out.

PHM learned algebraic structure

PonderNet halting at small scale

3D matrix attention

Cross-domain transfer (original framing)

Thought interleaving at toy scale

Learned byte segmentation at small scale

Frozen-bias fixes for the write-geometry attractor

Key-anchoring above K/d = 0.5

Learned depth selection for exact composition

Get in touch.