let them eat bytes.
Three parallel threads. Bytes replace the tokenizer, so the model sees the same stream the hardware does. Matrix-valued tokens replace flat vectors, giving every representation differentiable structural observables — rank, singular values — that a d-dimensional vector cannot expose. Continuous reasoning with matrix-tracked superposition replaces discrete chain-of-thought, so the number of reasoning paths a model holds at inference is measurable from a single sample rather than estimated across an ensemble. Early evidence: outer-product matrix embeddings cut per-parameter perplexity 11× against flat vectors at matched parameters; a matrix transformer layer at d = 16 uses 130× fewer parameters than vector attention at matched capacity. The first matrix-CODI experiment falsified one operationalization of rank-tracked superposition under a distillation objective and shipped a mechanism explaining why; the next runs the test on a matrix-native model trained from scratch.
matrix-CODI complete · findings being published
Each thread is independently testable and has its own near-term experiment. The compound claim is that bytes give a substrate, matrices give a representation with single-sample structural observables, and continuous reasoning over those matrices yields inference-time compute whose internals are directly measurable rather than inferred.
Subword tokenization is a learned compression layer between text and the model. It carries assumptions — language-shaped, English-shaped, web-shaped — and rules out raw audio waveforms, raw pixels, and arbitrary binary formats without a separate encoder per modality. Byte-level models drop that layer. BLT (Meta, 2024) replaced Chameleon's VQ tokenizer with byte patches at 8B parameters; EvaByte and MambaByte demonstrated byte-level processing at smaller scale.
Bytes introduce three implementation costs: sequence length grows ~3.5× over subword tokens for the same text and orders of magnitude more for audio and video, standard O(L²) attention does not scale to byte-length contexts, and a single byte carries near-zero signal so the model must reconstruct relational structure at every layer. The third cost is what motivates Stage 02.
A matrix has rank, condition number, and singular values; a flat vector of the same dimension has none of these as differentiable observables. A d²-dim vector and a d × d matrix carry the same number of scalars, but only the matrix exposes structure that downstream operations can preserve or destroy. The architectural commitment: every primitive in the model — embedding, attention, FFN, output head — operates on matrices without flattening. Composition is matrix multiplication, attention scores are Frobenius inner products, the output head is a bilinear probe.
Two findings make this load-bearing. Outer-product matrix embeddings beat flat-vector embeddings by 11× in per-parameter perplexity at matched parameters (Finding no. 01). A matrix transformer layer at d = 16 uses 130× fewer parameters than a standard vector attention layer at matched capacity (Finding no. 04), shifting the parameter budget from embeddings to thinking layers.
Continuous chain-of-thought models — COCONUT (Meta, 2024), CODI (Meta, 2025) — feed back continuous latent vectors instead of discrete tokens, letting the model perform inference-time compute internally. The theoretical claim (Reasoning by Superposition, CoT2, 2025) is that these latents hold multiple reasoning paths in superposition, exploring a search tree in parallel. A 2026 rebuttal (Illusion of Superposition, arXiv 2604.06374) showed that on ProsQA the latent feedback is not doing measurable work in some fine-tuned settings.
The dispute is empirical and the obstacle is measurement: superposition is a property of the representation, but a flat vector exposes no single-sample structure to probe. A matrix does. If a d × d thought Z holds k linearly independent patterns, its rank is k by construction, and rank is computable from a single sample via SVD. The matrix-CODI experiment built that probe and ran the falsification. The result: rank-k ablation curves are flat across two tasks and four readout designs, and best ProsQA accuracy decouples from final effective rank across seeds (81.51 ± 1.2pp at ranks {4, 12, 13}). The CODI training objective produces rank-indifferent gradients through the full chain rule. The next experiment moves the test off CODI distillation onto a matrix-native model trained from scratch on a synthetic task with provably K-path structure.
Supporting framing: Fedorenko et al. (Nature, 2024) dissociates the human language network from reasoning, math, and theory of mind — evidence that reasoning can run on representations that are not language-shaped.
Click any title for the full note.
A fork of CODI with a 16×16 matrix bottleneck at each latent reasoning step. Four flat rank-k projection curves across two tasks and two distillation regimes; a three-seed replication shows accuracy tight at 81.51 ± 1.2pp while the final effective rank of Z spans {4, 12, 13}. A four-readout positive control — bilinear, bilinear+GELU, SVD-augmented, quadratic — was designed to falsify the hypothesis that readout linearity causes the flat curves; all four readouts produce flat curves regardless. A negative control on vanilla GPT-2 SFT (no matrix bottleneck, no Z) reproduces a flat curve under the same intervention paradigm and demonstrates the rank-k probe alone conflates rank-blindness with position-irrelevance. The model-level distinguisher is the seed-decoupling result, which the negative control cannot produce by construction. Read the research note →
At d = 16, a matrix transformer layer uses 130× fewer parameters than a standard vector attention layer at matched capacity. This changes the parameter allocation of the whole model: where embeddings used to dominate, thinking layers can. Read the research note →
Across three runs that share the Matrix Thinker backbone but vary in output mechanism, the direction of the rank trajectory tracks the output mechanism. Bilinear probes see rank rise across 8 iterations; vector-collapse and 3D matrix-product runs see rank fall. The runs are not FLOPs-matched and this is an observational single-seed finding, not a causal claim. Read the research note →
With a bilinear output head, the effective rank of matrix token representations rises during iterative refinement (5.02 → 6.12). This runs counter to the prior literature's assumption that depth drives rank collapse. Not reported in any prior work. Read the research note →
At single-step processing, a matrix token produced by outer-product embedding beats a flat-vector baseline by 11× in perplexity at matched parameters. Reproduced across every configuration tested. Read the research note →
A multi-agent literature synthesis across continuous reasoning, JEPA, structured representations, byte-level models, and the neuroscience of language — with one counter-theme against the project's direction. The context the findings above should be read against.
Each item is hypothesis-driven, compute-estimated, and falsifiable. Compute estimates are conservative.
The matrix-CODI negative result localized the failure to the linear-in-Z readout. The next experiment removes that readout entirely. A small (~10M parameter) fully matrix-native transformer — matrix Q/K/V with true matrix composition, no flatten anywhere in the forward pass, matrix LM head — trained from scratch on a synthetic task whose ground truth provably requires K independent scalars of state. If matrix structure is functional, this is the experiment that demonstrates it.
Test four embedding designs: k-bigram outer products, conv encoder, local attention contextualization, and pairwise interaction matrices. The current rank-1 outer-product embedding is information-equivalent to storing two vectors. Higher-rank starting embeddings give the matrix meaningful structure from the input layer.
Machine-native representations from raw bytes, no tokenizer, no language scaffolding. Translates WavJEPA (raw audio waveforms) to byte streams with a conv byte patcher, using LeJEPA's sketched isotropic Gaussian regularization to prevent collapse. No published byte-level JEPA exists as of April 2026.
Combines MBLM-style hierarchy (Mamba for long-range patch processing, transformer for local byte processing) with matrix-valued patch representations. Extends the matrix architecture to 1M+ byte context windows. Required infrastructure for multi-domain training.
Train a single model on raw bytes from text, code, raw pixel images, and raw audio samples. Measure cross-domain transfer coefficients and representation alignment. Test whether matrix structure specifically enables transfer that flat-vector baselines can't match. Requires 10M+ parameters and a mixed-modality byte corpus, which has to be constructed first.
Matrix-CODI at 10-50M parameters on real benchmarks (GSM8K, MATH, ProsQA, MNNS). Where the toy-scale rank experiment tells if the phenomenon exists, this tells how much it matters when the model is actually competent. Published comparison against CODI, CoT2, CoLaR, and MarCos.
Generate a calibrated benchmark suite where the number of reasoning paths at each step is analytically computable (subset-sum variants, graph reachability, rule composition). This lets rank correlations be tested against ground truth rather than annotator-inferred step counts. Datasets released publicly.
HELM showed a fully hyperbolic billion-parameter LLM can match Euclidean baselines when every operation commits to the structured space. Replicate the approach with matrix-valued tokens: matrix attention, matrix FFN, matrix normalization, no flatten anywhere. Compute-intensive, only justified if the smaller experiments produce strong signal.
Negative results are data. These are the specific hypotheses I tested and ruled out, which narrowed the current direction. Each one was a pre-registered experiment with a clear falsification criterion.
Parameterized Hypercomplex Multiplication layers were supposed to learn quaternion-like algebra. Instead they converge to nilpotent structure — the optimizer treats PHM as a low-rank factorization rather than a learned algebra. CliffordNet (2026) confirmed the same thing from the other direction: algebraic structure works when fixed, not when learned.
Adaptive halting mechanisms collapse to "always stop at step 1" below ~10M parameters. Expected steps converges to 1.0. Use fixed iteration counts or LoopFormer-style consistency training instead.
Computing per-pair matrix products as attention scores drives representations to collapse to rank 1 and produces worse predictions. Dead end confirmed across multiple runs and scales.
The original "matrix structure enables cross-domain transfer" hypothesis was ruled out by six fatal attacks before any experiment ran. The question survives as a research direction but needed a sharper formulation and larger scale to be testable.
CoCoMix-style thought interleaving mechanisms don't produce benefits below ~10M parameters. The mechanism works but the scale is insufficient to measure its benefit.
Learned segmentation boundaries consistently underperform fixed-stride segmentation at the scales I tested. The learning signal isn't strong enough below ~100M parameters. BLT works at 8B, so this may reverse at scale — but not within this compute regime.