let them eat bytes.

Modern language models work around the medium they compute on — subword tokens for input, flat vectors for representation, and discrete-token chain-of-thought for reasoning. I'm replacing all three.

Three parallel threads. Bytes replace the tokenizer, so the model sees the same stream the hardware does. Matrix-valued tokens replace flat vectors, giving every representation differentiable structural observables — rank, singular values — that a d-dimensional vector cannot expose. Continuous reasoning with matrix-tracked superposition replaces discrete chain-of-thought, so the number of reasoning paths a model holds at inference is measurable from a single sample rather than estimated across an ensemble. Early evidence: outer-product matrix embeddings cut per-parameter perplexity 11× against flat vectors at matched parameters; a matrix transformer layer at d = 16 uses 130× fewer parameters than vector attention at matched capacity. The first matrix-CODI experiment falsified one operationalization of rank-tracked superposition under a distillation objective and shipped a mechanism explaining why; the next runs the test on a matrix-native model trained from scratch.

matrix-CODI complete  ·  findings being published

the three threads

Three parallel threads. The lab is built on the claim that they compose.

Each thread is independently testable and has its own near-term experiment. The compound claim is that bytes give a substrate, matrices give a representation with single-sample structural observables, and continuous reasoning over those matrices yields inference-time compute whose internals are directly measurable rather than inferred.

01 bytes

Replace the tokenizer.

Subword tokenization is a learned compression layer between text and the model. It carries assumptions — language-shaped, English-shaped, web-shaped — and rules out raw audio waveforms, raw pixels, and arbitrary binary formats without a separate encoder per modality. Byte-level models drop that layer. BLT (Meta, 2024) replaced Chameleon's VQ tokenizer with byte patches at 8B parameters; EvaByte and MambaByte demonstrated byte-level processing at smaller scale.

Bytes introduce three implementation costs: sequence length grows ~3.5× over subword tokens for the same text and orders of magnitude more for audio and video, standard O(L²) attention does not scale to byte-length contexts, and a single byte carries near-zero signal so the model must reconstruct relational structure at every layer. The third cost is what motivates Stage 02.

02 matrices

Replace flat vectors with d × d matrices.

A matrix has rank, condition number, and singular values; a flat vector of the same dimension has none of these as differentiable observables. A -dim vector and a d × d matrix carry the same number of scalars, but only the matrix exposes structure that downstream operations can preserve or destroy. The architectural commitment: every primitive in the model — embedding, attention, FFN, output head — operates on matrices without flattening. Composition is matrix multiplication, attention scores are Frobenius inner products, the output head is a bilinear probe.

Two findings make this load-bearing. Outer-product matrix embeddings beat flat-vector embeddings by 11× in per-parameter perplexity at matched parameters (Finding no. 01). A matrix transformer layer at d = 16 uses 130× fewer parameters than a standard vector attention layer at matched capacity (Finding no. 04), shifting the parameter budget from embeddings to thinking layers.

03 superposition

Make the number of reasoning paths a single-sample observable.

Continuous chain-of-thought models — COCONUT (Meta, 2024), CODI (Meta, 2025) — feed back continuous latent vectors instead of discrete tokens, letting the model perform inference-time compute internally. The theoretical claim (Reasoning by Superposition, CoT2, 2025) is that these latents hold multiple reasoning paths in superposition, exploring a search tree in parallel. A 2026 rebuttal (Illusion of Superposition, arXiv 2604.06374) showed that on ProsQA the latent feedback is not doing measurable work in some fine-tuned settings.

The dispute is empirical and the obstacle is measurement: superposition is a property of the representation, but a flat vector exposes no single-sample structure to probe. A matrix does. If a d × d thought Z holds k linearly independent patterns, its rank is k by construction, and rank is computable from a single sample via SVD. The matrix-CODI experiment built that probe and ran the falsification. The result: rank-k ablation curves are flat across two tasks and four readout designs, and best ProsQA accuracy decouples from final effective rank across seeds (81.51 ± 1.2pp at ranks {4, 12, 13}). The CODI training objective produces rank-indifferent gradients through the full chain rule. The next experiment moves the test off CODI distillation onto a matrix-native model trained from scratch on a synthetic task with provably K-path structure.

Supporting framing: Fedorenko et al. (Nature, 2024) dissociates the human language network from reasoning, math, and theory of mind — evidence that reasoning can run on representations that are not language-shaped.

research notes

Findings.

Click any title for the full note.

no. 05  ·  april 21, 2026

The gradient does not see rank: rank-indifference in matrix-CODI on ProsQA

A fork of CODI with a 16×16 matrix bottleneck at each latent reasoning step. Four flat rank-k projection curves across two tasks and two distillation regimes; a three-seed replication shows accuracy tight at 81.51 ± 1.2pp while the final effective rank of Z spans {4, 12, 13}. A four-readout positive control — bilinear, bilinear+GELU, SVD-augmented, quadratic — was designed to falsify the hypothesis that readout linearity causes the flat curves; all four readouts produce flat curves regardless. A negative control on vanilla GPT-2 SFT (no matrix bottleneck, no Z) reproduces a flat curve under the same intervention paradigm and demonstrates the rank-k probe alone conflates rank-blindness with position-irrelevance. The model-level distinguisher is the seed-decoupling result, which the negative control cannot produce by construction. Read the research note →

negative · structural
no. 04  ·  april 7, 2026

Matrix operations are 130× more parameter-efficient per layer

At d = 16, a matrix transformer layer uses 130× fewer parameters than a standard vector attention layer at matched capacity. This changes the parameter allocation of the whole model: where embeddings used to dominate, thinking layers can. Read the research note →

proven
no. 03  ·  april 5, 2026

Output mechanism shapes rank trajectory in matrix-valued refinement

Across three runs that share the Matrix Thinker backbone but vary in output mechanism, the direction of the rank trajectory tracks the output mechanism. Bilinear probes see rank rise across 8 iterations; vector-collapse and 3D matrix-product runs see rank fall. The runs are not FLOPs-matched and this is an observational single-seed finding, not a causal claim. Read the research note →

observational
no. 02  ·  april 2, 2026

Rank enrichment is an emergent, novel phenomenon

With a bilinear output head, the effective rank of matrix token representations rises during iterative refinement (5.02 → 6.12). This runs counter to the prior literature's assumption that depth drives rank collapse. Not reported in any prior work. Read the research note →

novel
no. 01  ·  february 28, 2026

Outer-product matrix embeddings outperform flat vectors per parameter

At single-step processing, a matrix token produced by outer-product embedding beats a flat-vector baseline by 11× in perplexity at matched parameters. Reproduced across every configuration tested. Read the research note →

proven
context

A reading of the field.

A multi-agent literature synthesis across continuous reasoning, JEPA, structured representations, byte-level models, and the neuroscience of language — with one counter-theme against the project's direction. The context the findings above should be read against.

Read the synthesis →

roadmap

Where this goes after matrix-CODI.

Each item is hypothesis-driven, compute-estimated, and falsifiable. Compute estimates are conservative.

now →

Matrix-native from scratch on a rank-K task

The matrix-CODI negative result localized the failure to the linear-in-Z readout. The next experiment removes that readout entirely. A small (~10M parameter) fully matrix-native transformer — matrix Q/K/V with true matrix composition, no flatten anywhere in the forward pass, matrix LM head — trained from scratch on a synthetic task whose ground truth provably requires K independent scalars of state. If matrix structure is functional, this is the experiment that demonstrates it.

~30 H100-hours
next →

Contextualized matrix embeddings

Test four embedding designs: k-bigram outer products, conv encoder, local attention contextualization, and pairwise interaction matrices. The current rank-1 outer-product embedding is information-equivalent to storing two vectors. Higher-rank starting embeddings give the matrix meaningful structure from the input layer.

~30 H100-hours
next →

Byte-level JEPA with LeJEPA SIGReg

Machine-native representations from raw bytes, no tokenizer, no language scaffolding. Translates WavJEPA (raw audio waveforms) to byte streams with a conv byte patcher, using LeJEPA's sketched isotropic Gaussian regularization to prevent collapse. No published byte-level JEPA exists as of April 2026.

~80 H100-hours
next →

Hierarchical long-context byte model

Combines MBLM-style hierarchy (Mamba for long-range patch processing, transformer for local byte processing) with matrix-valued patch representations. Extends the matrix architecture to 1M+ byte context windows. Required infrastructure for multi-domain training.

~250 H100-hours
scale →

Transfer learning across byte-level modalities

Train a single model on raw bytes from text, code, raw pixel images, and raw audio samples. Measure cross-domain transfer coefficients and representation alignment. Test whether matrix structure specifically enables transfer that flat-vector baselines can't match. Requires 10M+ parameters and a mixed-modality byte corpus, which has to be constructed first.

~1,000 H100-hours
scale →

Inference-time matrix reasoning at competitive scale

Matrix-CODI at 10-50M parameters on real benchmarks (GSM8K, MATH, ProsQA, MNNS). Where the toy-scale rank experiment tells if the phenomenon exists, this tells how much it matters when the model is actually competent. Published comparison against CODI, CoT2, CoLaR, and MarCos.

~1,500 H100-hours
scale →

Synthetic reasoning datasets with exact frontier measurement

Generate a calibrated benchmark suite where the number of reasoning paths at each step is analytically computable (subset-sum variants, graph reachability, rule composition). This lets rank correlations be tested against ground truth rather than annotator-inferred step counts. Datasets released publicly.

~200 H100-hours (generation + validation)
stretch →

Fully matrix-native architecture at scale

HELM showed a fully hyperbolic billion-parameter LLM can match Euclidean baselines when every operation commits to the structured space. Replicate the approach with matrix-valued tokens: matrix attention, matrix FFN, matrix normalization, no flatten anywhere. Compute-intensive, only justified if the smaller experiments produce strong signal.

~5,000 H100-hours
dead ends

Directions the evidence has ruled out.

Negative results are data. These are the specific hypotheses I tested and ruled out, which narrowed the current direction. Each one was a pre-registered experiment with a clear falsification criterion.

PHM learned algebraic structure

Parameterized Hypercomplex Multiplication layers were supposed to learn quaternion-like algebra. Instead they converge to nilpotent structure — the optimizer treats PHM as a low-rank factorization rather than a learned algebra. CliffordNet (2026) confirmed the same thing from the other direction: algebraic structure works when fixed, not when learned.

PonderNet halting at small scale

Adaptive halting mechanisms collapse to "always stop at step 1" below ~10M parameters. Expected steps converges to 1.0. Use fixed iteration counts or LoopFormer-style consistency training instead.

3D matrix attention

Computing per-pair matrix products as attention scores drives representations to collapse to rank 1 and produces worse predictions. Dead end confirmed across multiple runs and scales.

Cross-domain transfer (original framing)

The original "matrix structure enables cross-domain transfer" hypothesis was ruled out by six fatal attacks before any experiment ran. The question survives as a research direction but needed a sharper formulation and larger scale to be testable.

Thought interleaving at toy scale

CoCoMix-style thought interleaving mechanisms don't produce benefits below ~10M parameters. The mechanism works but the scale is insufficient to measure its benefit.

Learned byte segmentation at small scale

Learned segmentation boundaries consistently underperform fixed-stride segmentation at the scales I tested. The learning signal isn't strong enough below ~100M parameters. BLT works at 8B, so this may reverse at scale — but not within this compute regime.

contact

Get in touch.