let them eat bytes.

Modern language models work around the medium they compute on — subword tokens for input, flat vectors for representation, and discrete-token chain-of-thought for reasoning. I'm replacing all three.

Three parallel threads. Bytes replace the tokenizer, so the model sees the same stream the hardware does. Matrix-valued tokens replace flat vectors, giving every representation differentiable structural observables — rank, singular values — that a d-dimensional vector cannot expose. Continuous reasoning with matrix-tracked superposition replaces discrete chain-of-thought, so the number of reasoning paths a model holds at inference is measurable from a single sample rather than estimated across an ensemble. Early evidence: outer-product matrix embeddings cut per-parameter perplexity 11× against flat vectors at matched parameters; a matrix transformer layer at d = 16 uses 130× fewer parameters than vector attention at matched capacity. The first matrix-CODI experiment falsified one operationalization of rank-tracked superposition under a distillation objective and shipped a mechanism explaining why; the next runs the test on a matrix-native model trained from scratch.

matrix-CODI complete · findings being published

the three threads

Three parallel threads. The lab is built on the claim that they compose.

Each thread is independently testable and has its own near-term experiment. The compound claim is that bytes give a substrate, matrices give a representation with single-sample structural observables, and continuous reasoning over those matrices yields inference-time compute whose internals are directly measurable rather than inferred.

01 bytes

Replace the tokenizer.

Subword tokenization is a learned compression layer between text and the model. It carries assumptions — language-shaped, English-shaped, web-shaped — and rules out raw audio waveforms, raw pixels, and arbitrary binary formats without a separate encoder per modality. Byte-level models drop that layer. BLT (Meta, 2024) replaced Chameleon's VQ tokenizer with byte patches at 8B parameters; EvaByte and MambaByte demonstrated byte-level processing at smaller scale.

Bytes introduce three implementation costs: sequence length grows ~3.5× over subword tokens for the same text and orders of magnitude more for audio and video, standard O(L²) attention does not scale to byte-length contexts, and a single byte carries near-zero signal so the model must reconstruct relational structure at every layer. The third cost is what motivates Stage 02.

02 matrices

Replace flat vectors with d × d matrices.

A matrix has rank, condition number, and singular values; a flat vector of the same dimension has none of these as differentiable observables. A d²-dim vector and a d × d matrix carry the same number of scalars, but only the matrix exposes structure that downstream operations can preserve or destroy. The architectural commitment: every primitive in the model — embedding, attention, FFN, output head — operates on matrices without flattening. Composition is matrix multiplication, attention scores are Frobenius inner products, the output head is a bilinear probe.

Two findings make this load-bearing. Outer-product matrix embeddings beat flat-vector embeddings by 11× in per-parameter perplexity at matched parameters (Finding no. 01). A matrix transformer layer at d = 16 uses 130× fewer parameters than a standard vector attention layer at matched capacity (Finding no. 04), shifting the parameter budget from embeddings to thinking layers.

03 superposition

Make the number of reasoning paths a single-sample observable.

Continuous chain-of-thought models — COCONUT (Meta, 2024), CODI (Meta, 2025) — feed back continuous latent vectors instead of discrete tokens, letting the model perform inference-time compute internally. The theoretical claim (Reasoning by Superposition, CoT2, 2025) is that these latents hold multiple reasoning paths in superposition, exploring a search tree in parallel. A 2026 rebuttal (Illusion of Superposition, arXiv 2604.06374) showed that on ProsQA the latent feedback is not doing measurable work in some fine-tuned settings.

The dispute is empirical and the obstacle is measurement: superposition is a property of the representation, but a flat vector exposes no single-sample structure to probe. A matrix does. If a d × d thought Z holds k linearly independent patterns, its rank is k by construction, and rank is computable from a single sample via SVD. The matrix-CODI experiment built that probe and ran the falsification. The result: rank-k ablation curves are flat across two tasks and four readout designs, and best ProsQA accuracy decouples from final effective rank across seeds (81.51 ± 1.2pp at ranks {4, 12, 13}). The CODI training objective produces rank-indifferent gradients through the full chain rule. The next experiment moves the test off CODI distillation onto a matrix-native model trained from scratch on a synthetic task with provably K-path structure.

Supporting framing: Fedorenko et al. (Nature, 2024) dissociates the human language network from reasoning, math, and theory of mind — evidence that reasoning can run on representations that are not language-shaped.

roadmap

Where this goes after matrix-CODI.

Each item is hypothesis-driven, compute-estimated, and falsifiable. Compute estimates are conservative.

now →

Matrix-native from scratch on a rank-K task

The matrix-CODI negative result localized the failure to the linear-in-Z readout. The next experiment removes that readout entirely. A small (~10M parameter) fully matrix-native transformer — matrix Q/K/V with true matrix composition, no flatten anywhere in the forward pass, matrix LM head — trained from scratch on a synthetic task whose ground truth provably requires K independent scalars of state. If matrix structure is functional, this is the experiment that demonstrates it.

~30 H100-hours

Contextualized matrix embeddings

Test four embedding designs: k-bigram outer products, conv encoder, local attention contextualization, and pairwise interaction matrices. The current rank-1 outer-product embedding is information-equivalent to storing two vectors. Higher-rank starting embeddings give the matrix meaningful structure from the input layer.

~30 H100-hours

Byte-level JEPA with LeJEPA SIGReg

Machine-native representations from raw bytes, no tokenizer, no language scaffolding. Translates WavJEPA (raw audio waveforms) to byte streams with a conv byte patcher, using LeJEPA's sketched isotropic Gaussian regularization to prevent collapse. No published byte-level JEPA exists as of April 2026.

~80 H100-hours

Hierarchical long-context byte model

Combines MBLM-style hierarchy (Mamba for long-range patch processing, transformer for local byte processing) with matrix-valued patch representations. Extends the matrix architecture to 1M+ byte context windows. Required infrastructure for multi-domain training.

~250 H100-hours

scale →

Transfer learning across byte-level modalities

Train a single model on raw bytes from text, code, raw pixel images, and raw audio samples. Measure cross-domain transfer coefficients and representation alignment. Test whether matrix structure specifically enables transfer that flat-vector baselines can't match. Requires 10M+ parameters and a mixed-modality byte corpus, which has to be constructed first.

~1,000 H100-hours

scale →

Inference-time matrix reasoning at competitive scale

Matrix-CODI at 10-50M parameters on real benchmarks (GSM8K, MATH, ProsQA, MNNS). Where the toy-scale rank experiment tells if the phenomenon exists, this tells how much it matters when the model is actually competent. Published comparison against CODI, CoT2, CoLaR, and MarCos.

~1,500 H100-hours

scale →

Synthetic reasoning datasets with exact frontier measurement

Generate a calibrated benchmark suite where the number of reasoning paths at each step is analytically computable (subset-sum variants, graph reachability, rule composition). This lets rank correlations be tested against ground truth rather than annotator-inferred step counts. Datasets released publicly.

~200 H100-hours (generation + validation)

stretch →

Fully matrix-native architecture at scale

HELM showed a fully hyperbolic billion-parameter LLM can match Euclidean baselines when every operation commits to the structured space. Replicate the approach with matrix-valued tokens: matrix attention, matrix FFN, matrix normalization, no flatten anywhere. Compute-intensive, only justified if the smaller experiments produce strong signal.

~5,000 H100-hours

dead ends

Directions the evidence has ruled out.

Negative results are data. These are the specific hypotheses I tested and ruled out, which narrowed the current direction. Each one was a pre-registered experiment with a clear falsification criterion.

PHM learned algebraic structure

Parameterized Hypercomplex Multiplication layers were supposed to learn quaternion-like algebra. Instead they converge to nilpotent structure — the optimizer treats PHM as a low-rank factorization rather than a learned algebra. CliffordNet (2026) confirmed the same thing from the other direction: algebraic structure works when fixed, not when learned.

PonderNet halting at small scale

Adaptive halting mechanisms collapse to "always stop at step 1" below ~10M parameters. Expected steps converges to 1.0. Use fixed iteration counts or LoopFormer-style consistency training instead.

3D matrix attention

Computing per-pair matrix products as attention scores drives representations to collapse to rank 1 and produces worse predictions. Dead end confirmed across multiple runs and scales.

Cross-domain transfer (original framing)

The original "matrix structure enables cross-domain transfer" hypothesis was ruled out by six fatal attacks before any experiment ran. The question survives as a research direction but needed a sharper formulation and larger scale to be testable.

Thought interleaving at toy scale

CoCoMix-style thought interleaving mechanisms don't produce benefits below ~10M parameters. The mechanism works but the scale is insufficient to measure its benefit.

Learned byte segmentation at small scale

Learned segmentation boundaries consistently underperform fixed-stride segmentation at the scales I tested. The learning signal isn't strong enough below ~100M parameters. BLT works at 8B, so this may reverse at scale — but not within this compute regime.

Modern language models work around the medium they compute on — subword tokens for input, flat vectors for representation, and discrete-token chain-of-thought for reasoning. I'm replacing all three.

Three parallel threads. The lab is built on the claim that they compose.

Replace the tokenizer.

Replace flat vectors with d × d matrices.

Make the number of reasoning paths a single-sample observable.

Findings.

The gradient does not see rank: rank-indifference in matrix-CODI on ProsQA

Matrix operations are 130× more parameter-efficient per layer

Output mechanism shapes rank trajectory in matrix-valued refinement

Rank enrichment is an emergent, novel phenomenon

Outer-product matrix embeddings outperform flat vectors per parameter

A reading of the field.

Where this goes after matrix-CODI.

Matrix-native from scratch on a rank-K task

Contextualized matrix embeddings

Byte-level JEPA with LeJEPA SIGReg

Hierarchical long-context byte model

Transfer learning across byte-level modalities

Inference-time matrix reasoning at competitive scale

Synthetic reasoning datasets with exact frontier measurement

Fully matrix-native architecture at scale

Directions the evidence has ruled out.

PHM learned algebraic structure

PonderNet halting at small scale

3D matrix attention

Cross-domain transfer (original framing)

Thought interleaving at toy scale

Learned byte segmentation at small scale

Get in touch.