This note is a literature synthesis, not an experiment. In early April 2026 the author ran a multi-agent research session across eleven topics in parallel — continuous reasoning, JEPA, pure-sensor SSL, structured representations, byte-level models, long-context methods, the neuroscience of language, and several adjacent threads — and collected the findings into a reading of the field's recent direction. Five supporting themes converged: non-language, non-autoregressive representation learning is gaining momentum at scale (JEPA plus pure-sensor SSL); discrete vocabularies have lost the text race while winning in vision; the structure-versus-scale debate is unresolved but pervasive structure retains a compute-efficiency advantage in the measurements that exist; the neuroscience case for non-linguistic cognition is now mainstream; and continuous reasoning research is maturing around a small set of published baselines without measuring any structural property of the thoughts it produces. A sixth, counter-theme cuts the other way: at language-modeling compute parity, vector transformers still beat matrix-valued models in this project's own logs, and the synthesis does not escape that constraint. The five supporting themes point to a specific gap the matrix-CODI experiment targets — no published work has measured the rank of continuous thoughts as a structural correlate of reasoning capacity — while the counter-theme sets a hard boundary on how much the reading can justify on its own. This is a single reader's reading of a large literature on a fixed date; the confirmation-bias risk is flagged directly in Section 05.
01Background
A one-person lab has a compute budget measured in H100-hours and a time budget measured in weeks per experiment. Direction-setting — what to build next, what to drop, what to defer — has a disproportionate effect on the final output compared with the same decision inside a well-resourced group, because a single wrong direction consumes a larger fraction of the total work. Reading the field carefully is the cheapest return on time available at this scale.
The four experimental notes on this site (finding 01 through finding 04) report single-seed observations from the Matrix Thinker experiments. They are load-bearing for the project's empirical claims, but they were produced under a specific research framing that took shape in mid-2025 and has been under continuous attack since. The framing has narrowed over time. In its earliest form the project argued that matrix-valued tokens would produce cross-domain generalization through structural geometry; an attack pass in April 2026 showed that framing was wrong on six independent grounds (reshape equivalence, distribution mismatch, parameter confound, scale, reviewer failure mode, domain alignment). What survived is a narrower question about rank as a structural correlate of reasoning capacity in continuous-thought models. This note reports the literature reading that informed the narrowing and that the matrix-CODI experiment now sits inside.
The reason to publish a synthesis rather than keep it internal is that the notes in this series are internally cross-referenced and the reading of the field is the context against which the single-seed observations should be interpreted. Without it, a reader of the other four notes is seeing a sequence of small experimental results with no account of why we ran them in that order and not another. With it, the sequence becomes a bet against a specific open question in the continuous-reasoning literature, placed against a reading of where the field's most interesting threads are pointing.
02Method
In April 2026, the author ran a multi-agent research session using Claude Code subagents. Each subagent was dispatched to survey a specific topic (JEPA, pure-sensor SSL, discrete vocabularies, and so on) by reading arXiv, major conference proceedings, and lab blog posts. The subagents returned structured reports that the author then synthesized into this note. This is not a systematic review in the clinical-research sense — there was no pre-registered search protocol, no blinding, and no independent validation of the returned paper lists. The agents may have missed relevant papers or hallucinated references; where possible, citations were cross-checked against the project bibliography (references.md). ML and AI topics carried a recency constraint of the last six to twelve months; mature fields like neuroscience and information theory had no recency cap. The eleven topics were:
- Continuous-reasoning models after COCONUT
- JEPA family and LeCun's post-Meta direction
- Pure-sensor / language-free representation learning
- Structured representations beyond flat vectors (hyperbolic, Clifford, TPR, PHM)
- Byte-level language models and the post-tokenizer world
- Long-context and memory-sparse attention methods
- Rank and dimensionality measurement in transformers
- Discrete vocabularies and modern VQ
- Auxiliary-loss and multi-token prediction methods
- Neuroscience of language and non-linguistic cognition
- Test-time compute and energy-based refinement
For each topic the agents returned published papers (with arXiv or venue identifiers), code availability where relevant, the central empirical or theoretical claim, the strongest published critique, and what the topic does not answer. The last item is the part that matters for a synthesis whose purpose is finding gaps. The outputs are archived under research/ in the project repository; the session transcripts are internal.
Two caveats up front. The sampling frame is not exhaustive — the agents searched arXiv, major venue proceedings, and a handful of lab and author pages, with a bias toward English-language work and toward topics the project was already thinking about. The reader of the outputs is one person. There is no independent validation of the synthesis below, and the reader is the same person who chose the topics, framed the queries, and is now building the experiment the synthesis conveniently supports. Section 05 (Limitations) returns to this directly.
03Themes
The six themes below are five supporting themes plus one counter-theme. The supporting themes are the ones that six or more cited papers from the April 2026 reading converged on. The counter-theme (3.6) is a result from this project's own experiment log that cuts against the supporting reading and is included so the synthesis is not a one-sided selection. Several other threads were investigated and did not reach the bar for either category (energy-based refinement, test-time training, long-context sparse attention, hypercomplex multi-modal fusion); they are noted in Section 05.
3.1 Non-language, non-autoregressive representation learning at scale
Two adjacent threads — JEPA family methods and pure-sensor self-supervised learning — tell the same story from different angles: large-scale representation learning without language supervision and without next-token prediction is becoming a serious alternative to the autoregressive default rather than a fringe research program. The merge is intentional; the two sub-threads reinforce each other at the theme level.
3.1.a JEPA momentum
Joint-embedding predictive architectures moved from a within-Meta research program to a broader direction over the last twelve months. LeJEPA [1] (November 2025) addressed JEPA's long-standing representation-collapse problem via Sketched Isotropic Gaussian Regularization — a single hyperparameter, no stop-gradient, no EMA, proven optimal target distribution under linear, k-NN, and kernel probes. Twenty lines of code, drops in. This removes the "JEPA is brittle" objection that was the standard reason not to build on top of it. LLM-JEPA [2] (September 2025) added JEPA auxiliary losses to standard LLM training and reports modest but statistically significant gains on a small set of benchmarks including GSM8K, showing that JEPA composes with the standard autoregressive recipe rather than replacing it.
V-JEPA 2 [3] (June 2025) scaled video JEPA to 1.2B parameters and reported 77.3 top-1 on Something-Something-v2, with robot pick-and-place scores that put it close to end-to-end trained policies on novel objects. V-JEPA 2.1 [4] (March 2026) added dense features and hierarchical self-supervision. VL-JEPA [5] (December 2025) predicts continuous text embeddings in place of autoregressive tokens and matches CLIP and SigLIP with roughly half the trainable parameters. LeWorldModel [6] (March 2026) is the first end-to-end JEPA from raw pixels with no stop-grad, no EMA, and no frozen encoders — 15M parameters on a single GPU, using LeJEPA's SIGReg unchanged. WavJEPA [7] extends the recipe to raw audio waveforms and is the closest existing template for a byte-level JEPA, which this reading confirmed does not yet exist in the public literature as of April 2026.
3.1.b Pure-sensor SSL matches language-supervised SSL at scale
For several years the default reading of visual representation learning was that language supervision (CLIP-style contrastive training on image-text pairs) gave a decisive advantage over pure self-supervised methods trained on images alone. Three results from 2025 shift that reading.
DINOv3 [8] (August 2025) is a ViT-7B trained on 1.7B images with no text and no labels. It is the first SSL model to beat weakly-supervised peers on the benchmarks it reports, including ImageNet classification, and its Gram anchoring technique addresses the dense-feature degradation that had limited earlier SSL models at scale. Web-SSL [9] (April 2025, Meta) is the controlled comparison: SSL and CLIP trained on identical MetaCLIP data, scaled to 7B parameters, and the visual SSL model matches CLIP on VQA and classic vision benchmarks. The authors' framing — that CLIP's prior advantage came from data rather than from language supervision — is the clean version of the result. The follow-up comparison paper [10] (October 2025, EMNLP Findings) found that CLIP wins text-intensive and fine-grained tasks while DINO wins vision-centric and low-level tasks, framing language supervision as a narrow bias toward nameable categories rather than a general advantage.
The finding that matters most for a project that wants to build machine-native representations comes from an unexpected direction. An October 2025 paper on object binding in vision transformers [11] probed patch embeddings with an IsSameObject classifier and found above 90% accuracy in DINOv2, DINO, and MAE, and near-chance accuracy in supervised ViTs. Object binding — a property the interpretability literature had argued is necessary for compositional reasoning — arises from self-supervision specifically, not from the backbone. If binding is not downstream of language supervision, then the claim that language is a prerequisite for compositional representations in neural networks is harder to defend.
3.2 Discrete vocabularies have lost the text race
The single cleanest data point in the synthesis comes from Meta's public research trajectory. Meta's public research direction moved from Chameleon [12] (a VQ-tokenized multimodal model released in 2024) to BLT [13] (byte-level, tokenizer-free, released late 2024/2025, with entropy-based dynamic patching that matches Llama 3 at 8B scale with 50% fewer inference FLOPs). Both papers are public; no formal deprecation has been announced. Bolmo [14] (AI2, December 2025) retrofits existing tokenized models to byte-level and reports a +16.5% absolute STEM improvement over BLT 7B, suggesting the advantage is not a one-shot BLT-specific trick.
The picture is different in vision and multimodal work. Emu3.5 [15] (34.1B, October 2025) is the largest native-discrete-token multimodal model and uses separate text BPE and visual VQ codebooks. Modern VQ methods (FSQ [16], BSQ [17], MAGVIT-v2 [18], SimVQ, UniTok [19]) continue to improve on codebook utilization and representation quality. The cross-modal picture is "VQ won vision, is contested for audio, and lost text." For a project that cares about text and language-based reasoning, the text column is the one that matters, and the arrow is pointing away from learned discrete vocabularies. MBLM [20], EvaByte [21], MambaByte [22], MEGABYTE [23], and bGPT [24] populate the byte-level side of the same move.
3.3 Structure-vs-scale is unresolved, but pervasive structure keeps winning when it is pervasive
HELM [25] (He et al., NeurIPS 2025) is the first billion-parameter fully hyperbolic LLM. All operations live on the Lorentz manifold via space-like-only operations plus constraint reconstruction. It reports gains of 0.5 to 2.3 points over Euclidean baselines on MMLU and ARC — small absolute deltas on near-chance benchmarks, where a 2-point move from 23% to 25% on MMLU (random is 25%) is a much weaker signal than a 2-point move at 70% accuracy. The headline that matters is not the margin but the architectural commitment: no half-measures, no Euclidean escape hatch, no fallback layer. It is an existence proof that pervasive structured architecture trains stably at the scale where people care about scaling arguments. The headline is not that it wins the benchmark.
Brehmer et al. [26] (TMLR 2025) is the companion result on the structure side. The paper is often misread as a pro-scaling result because it finds that the gap between equivariant and non-equivariant models shrinks at some budgets, but the careful reading is that equivariant models maintain roughly a 2× compute-efficiency advantage at every budget tested across 10^16 to 10^19 FLOPs. The advantage does not close; the claim that scale alone absorbs structure does not survive the measurement.
The counterweight on the other side is the Redhardt, Akram, Schug NeurIPS 2025 spotlight [27], which shows that synthetic MLPs achieve compositional generalization given full task-space coverage — the strongest scaling-side result in the literature at the moment. The limitation is that the synthetic setup does not transfer cleanly to language. Wilson's "Deep Learning is Not So Mysterious" [28] gives the PAC-Bayes framing that overparameterization plus a soft simplicity prior suffices without architectural restriction; HELM and Brehmer are the existence proofs that the argument does not carry all the way when the restriction runs through the whole stack and the evaluation is careful.
The pattern the synthesis extracts: structure wins when it is pervasive across the stack (HELM), loses or appears to lose when it is bolted on to a mostly unstructured backbone, and the Brehmer measurement shows the advantage is stable at scale rather than washing out. Half-commitments do not work. This matters for pebble's direction because the Matrix Thinker stack is structured at the token level and the attention level but not, currently, at the embedding table level or the output head in the byte-level configurations — which is one of the things the narrowed hypothesis forces us to examine.
3.4 The neuroscience case for non-linguistic cognition is mainstream
The Fedorenko, Piantadosi, and Gibson paper "Language is primarily a tool for communication rather than thought" [29] appeared in Nature in 2024 and, more than any single result in this synthesis, changed the ambient assumption a project like this operates under. fMRI dissociates the language network from reasoning, math, theory of mind, and music. The underlying dissociation has been reported for years in the cognitive-neuroscience literature; what changed in 2024 was the venue and the review-level framing. "Language is the medium of thought" is a contested claim rather than a default.
Zaslavsky et al. [30] (PNAS 2018) is an information-theoretic complement, though a narrower one than it is sometimes taken to be. The paper's result is specific to color naming: languages' color-term systems cluster near the information bottleneck optimum for the task of communicating color categories, which is evidence that languages evolve toward a communication-efficient coding for the task they are being used on. The paper is not a general theorem about inter-brain channel capacity, and it does not license the broader "language is optimized for a ~50 bit/s brain-to-brain channel" framing that appears in some readings. Taken at what the paper shows, it supports a narrower point: human language evolved under communication pressure and reflects that pressure at the lexicon level. Machines face a different optimization landscape, so the lexicon-level efficiencies do not automatically transfer.
Below the language-layer claim there is a set of now-well-established neural primitives: grid cells as a general substrate for abstract reasoning [31] [32]; attractor manifolds and integrator networks [33] [34]; sparse coding in visual cortex [35]; predictive coding [36] and the free-energy framework [37] [38]. None of these primitives are linguistic. The Trends in Cognitive Sciences 2024 review on the dimensionality of cognition [39] and the Journal of Neuroscience February 2025 paper on PFC dimensionality and cognitive control [40] report that the prefrontal cortex deploys a gradient of representational dimensionality across regions, and that dimensionality collapses on error trials. The brain has an observable rank-like quantity that tracks task difficulty and cognitive success. We cite this as motivation, not as mechanism: the point is that the "reasoning has a measurable structural correlate" framing has precedent on the biological side.
3.5 Continuous reasoning research is maturing without measuring structure
The continuous reasoning literature has a clear lineage. COCONUT [41] (Hao et al., Meta, ICLR 2025) established the setup: feed the last hidden state back as the next input embedding, use a curriculum to replace text reasoning with continuous thoughts, measure on GSM8K and ProsQA. COCONUT reported GSM8K 34.1% and ProsQA 97.0%. CODI [42] (Shen et al., EMNLP 2025) is the strongest successor: joint teacher-student via shared weights, L1 distillation at the ":" token across all layers, and simpler training with no curriculum. CODI reports GSM8K 43.7% against CoT 44.1% — effectively closing the gap to explicit chain-of-thought. Public code at github.com/zhenyi4/codi, which is the codebase pebble's matrix-CODI experiment will fork from.
CoT2 [43] (Gozeten et al., ICLR 2026) is the closest published prior art for the rank-superposition argument. It proves that parallelism in continuous chain-of-thought scales with embedding dimension via an existence construction (a 1-layer transformer with an MoE MLP solving MNNS), and tests dimension thresholds empirically through accuracy curves. The paper never measures rank or any structural property. Its claim about parallelism is validated indirectly through task-success curves. CoLaR [44], MarCos [45], SIM-CoT [46], PCCoT [47], and the survey [48] populate the field around it. Emergence of Superposition [49] (September 2025) studies training dynamics of continuous CoT on graph reachability and proves that an index-matching logit grows then saturates, but again does not measure rank. Reasoning by Superposition [50] (Zhu et al., May 2025) is the theoretical paper whose hand-designed continuous thoughts define the superposition hypothesis — it sets t_c = (1/√|V_c|) Σ u_v over BFS frontier vertices, so rank equals frontier size by construction — but never measures whether trained models realize this rank.
The 2026 rebuttal, The Illusion of Superposition [51], argues that fine-tuned COCONUT reaches 96.6% without any latent tokens and that entity-probes show no stepwise computation. The superposition claim is contested. Neither side of the dispute has produced a direct structural measurement of the thoughts themselves. The theorists define superposition and never measure it. The dimensional-collapse work (Dong et al. [52], the 2025 work on attention-output rank collapse [53], rank-structure in time-series transformers [54]) measures rank in transformer backbones but not in reasoning models. The interpretability work on chain-of-thought uses feature dictionaries and entity probes, not geometric rank. This is the gap the matrix-CODI experiment targets.
3.6 Counter-theme: scaling continues to win on text, and matched-FLOP comparisons still favor vector transformers
The five themes above are written as evidence for the project's direction. A synthesis that only collects support functions as an argument in literature-review clothing. The clearest result cutting the other way belongs in the same section as the supporting ones, where the reader can see it, rather than in a footnote.
Scaling continues to win on text. Structured-architecture gains in the recent literature are either at small scale or on narrow benchmarks. At language-modeling compute parity — matched FLOPs rather than matched parameters — vector transformers still beat matrix-valued models in the measurements this project has run. The relevant internal data point is Run 14 in EXPERIMENT_LOG.md: LoopFormer (vector baseline) reaches 0.87 BPB, Matrix Thinker reaches 1.67 BPB, on the same dataset, at matched FLOPs. The parameter-efficiency note (finding 04) reports the same ordering from the other angle. On the scaling side of the literature the strongest result is Redhardt, Akram, and Schug's NeurIPS 2025 Spotlight [27], which shows synthetic MLPs achieving compositional generalization from full task-space coverage without architectural constraints. Brehmer et al. [26] is often misread as pro-scaling; a careful reading places it on the pro-structure side at the efficiency margin, but it does not overturn the matched-FLOP text result: the structured model wins per-FLOP, the unstructured model still wins on the downstream loss at the FLOP budgets being compared.
The synthesis does not escape this constraint. Themes 3.1 through 3.5 are a reading of directions the field is exploring; theme 3.6 is a reading of what the current matched-FLOP comparisons show. The matrix-CODI experiment is designed to answer a structural question about continuous thoughts, not to close the matched-FLOP gap. If the narrowed hypothesis (rank as a structural correlate of reasoning capacity) holds, it is evidence for a specific interpretability claim; it does not by itself overturn the LoopFormer result. A reader who weights theme 3.6 heavily should expect the matrix direction to remain a research bet rather than a production recipe, and should expect pebble to iterate the byte-level JEPA path and the auxiliary-loss path as hedges against the matrix direction failing.
04Synthesis
The five supporting themes point in one direction; the counter-theme sets a boundary on how far that direction can travel unaided. A small visual to make the pattern legible:
The narrowed hypothesis has three parts, stated in the precise form used in STATE.md:
- H1 (correlation). In a continuous-reasoning language model with matrix-valued thoughts, the effective rank of a thought matrix at reasoning step t correlates monotonically with the number of distinct reasoning paths held in superposition at that step. For tasks with known frontier sizes, Spearman correlation ρ > 0.3 at p < 0.01.
- H2 (capacity bound). Accuracy degrades when the number of required reasoning paths exceeds the matrix dimension d. A d × d matrix can correctly hold at most d linearly independent reasoning paths.
- H3 (causation). Forced low-rank projection of the thought matrix to rank k < |frontier| causes accuracy degradation in proportion to |frontier| − k. This rules out the alternative explanation that rank is a side effect rather than the mechanism.
The position the hypothesis sits in is narrow and specific. CoT2 defines a capacity bound in dimensional terms without measuring structure. Reasoning by Superposition hand-designs thoughts whose rank equals frontier size by construction and never tests trained models. The Illusion of Superposition argues the superposition story is empirically empty. Dimensional-collapse work measures rank in transformer backbones, not in reasoning loops. No published work simultaneously (a) trains a reasoning model, (b) measures the rank of its continuous thoughts, and (c) tests whether that rank causally tracks reasoning capacity. This is a single-experiment gap. Two days of code work, roughly two hours of compute on 8×H100, and the gap is closed in one direction or the other. Either the correlation and the causal ablation hold, in which case Reasoning by Superposition has a structural correlate and the dispute with The Illusion of Superposition has a falsifiable resolution, or they do not hold, in which case the Illusion rebuttal wins and pebble reframes. Both outcomes are publishable, and neither overturns theme 3.6 on its own — the structural measurement is orthogonal to the matched-FLOP loss comparison.
The second-order reading is that the four other notes in this series (rank enrichment, output-head dynamics, outer-product embedding, parameter efficiency) are all load-bearing for whether the matrix-CODI experiment is worth running. Rank enrichment and output-head dynamics establish that rank as measured on matrix tokens in iterative refinement is a non-trivial observable whose trajectory depends on the read-out. The outer-product embedding note establishes that the rank-1 starting point gives a low T=1 BPB at matched parameters in a direction that has held across three comparisons. The parameter-efficiency note establishes that the matrix projections the experiment uses are cheap enough in parameter terms to allocate a meaningful thinking block to a small model. None of the four notes alone justifies the matrix-CODI direction; the synthesis of the four with the field reading is what puts the experiment on the critical path.
The project's other avenues — byte-level JEPA with SIGReg, contextualized matrix embeddings, scaling to 10M+ parameters on standard benchmarks — are downstream of this first test. If the rank measurement on matrix-CODI produces signal, they become the scaling story. If it does not, at least one of them (likely the byte-level JEPA path, which draws directly on LeJEPA and WavJEPA) becomes the new lead and the matrix framing is retired to an auxiliary loss inside a JEPA-style training loop.
05Limitations
- Date-bounded reading. The synthesis reflects the literature as of April 9 2026. Several of the cited papers are weeks or months old; the field will have moved by the time this note is read. The JEPA direction and the continuous-reasoning direction in particular have been moving fast enough that a six-month-later synthesis would likely reorder several of the themes.
- Single reader, no independent validation. The synthesis was produced by one person running multi-agent research sessions and collating the outputs. There is no second reader, no adversarial review, no structured consensus procedure. Where the literature is contested (the superposition dispute between CoT2/Reasoning-by-Superposition and Illusion-of-Superposition is the clearest example), the reading is the author's judgment call rather than a community consensus.
- Confirmation bias is the limitation that worries us most. The author chose the eleven topics, framed the subagent queries, read the subagent outputs, and wrote the thematic groupings — and is the same person building the experiment the synthesis supports. The selection is not blind to the project's direction. Concretely, a reading that led to "build something other than matrix-CODI" is not reachable from the workflow that produced this note, and that should be priced in. Three specific counter-observations name the shape of the bias in the current literature: (a) Brehmer et al. TMLR 2025 [26] shows that data augmentation substantially closes the equivariance gap at scale, which weakens the "pervasive structure wins" reading of theme 3.3 if the gap is closed by simpler means; (b) Redhardt, Akram, and Schug's NeurIPS 2025 Spotlight [27] shows compositional generalization emerging from scale alone in synthetic MLPs, which weakens the "structure is necessary for compositionality" framing that animates several of the themes; and (c) this project's own Run 14 (LoopFormer 0.87 BPB vs Matrix Thinker 1.67 BPB at matched FLOPs, in EXPERIMENT_LOG.md) is an empirical loss against the matrix direction on the metric that matters most for text language modeling. The mitigation is partial: the attack pass in April 2026 (research/hypothesis-attack-april2026.md) killed the earlier and broader version of the hypothesis on six grounds, the narrowed hypothesis is the residue of that attack, and the matrix-CODI experiment is structured so that either direction of the result is informative. If the correlation and causal ablation both fail, the reading of theme 3.5 is wrong and we reframe. None of this eliminates the bias. It limits the damage it can do.
- Sampling frame. The agents searched arXiv, major venue proceedings, and a handful of author and lab pages. They did not systematically search workshop papers, non-English publications, or pre-2024 work outside the named reference lists. A result we missed that contradicts the synthesis would most likely come from one of those sources.
- Threads investigated and excluded. Energy-based transformers [55], test-time training [56], long-context sparse attention (Native Sparse Attention [57], DeepSeek Sparse Attention [58], Kimi Linear [59]), and hypercomplex multi-modal fusion were investigated and did not reach the bar for a theme. This is not the same as unimportance — several of them may matter a lot for the project's later stages. They are flagged so the absence is deliberate rather than ignorance.
- Not an experiment. This note reports a reading of a literature, not new empirical results. It should be weighed as such. The other four notes in this series report measurements the author made on models the author trained; this note reports the author's reading of papers written by other people. The epistemic status is different and the reader should treat it as different.
- Paper provenance. Several of the 2026 papers cited here (Illusion of Superposition, LeWorldModel, V-JEPA 2.1, HyperET, ByteFlow, and others) came out within the weeks prior to this synthesis and have not had time to accumulate replications or citations. The reading of them is based on the papers themselves rather than on community assessment. Any of them could look different in six months.
Papers and results that would falsify this reading
To make the confirmation-bias section operational rather than ritual, three specific results already in the literature (or in this project's own logs) would falsify the synthesis's direction if they hold up under the scrutiny they deserve. They are listed here so that a reader can check whether each one has strengthened or weakened since April 2026.
- Brehmer et al., TMLR 2025 [26]. The pro-structure reading used in theme 3.3 depends on equivariant models holding a roughly 2× compute-efficiency advantage across the FLOP range measured. A careful reading of the paper that concludes data augmentation closes the equivariance gap at scale, rather than structured architecture being the cause, would directly undercut theme 3.3. The same paper is cited both ways in the current literature; the reader should check which reading has won by the time they see this note.
- Redhardt, Akram, and Schug, NeurIPS 2025 Spotlight [27]. The paper shows compositional generalization emerging from scale alone in synthetic MLPs given full task-space coverage. If the result transfers to language at realistic scale — which the paper does not claim and the current synthesis treats as not yet demonstrated — then several supporting themes become weaker. The "structure is the mechanism for compositional reasoning" framing stops being load-bearing if scale alone is sufficient.
- Run 14 in the project's own EXPERIMENT_LOG.md. LoopFormer at 0.87 BPB vs Matrix Thinker at 1.67 BPB at matched FLOPs on the same dataset. If this ordering persists at 10M and 100M parameters under the matched-FLOP protocol, the matrix direction cannot justify itself on language-modeling loss alone, and the matrix-CODI experiment's survival depends entirely on the structural measurement being informative in a way that downstream users care about. If the structural measurement also fails to find signal, the matrix direction should be retired.
The synthesis is falsifiable at the per-theme level because of this list. A reader who wants to disagree can point at one of these three and explain which theme it undercuts.
06Future work
Two directions improve this synthesis.
The first is the matrix-CODI experiment itself. If it closes the rank-measurement gap in theme 3.5, the synthesis above stops being a literature reading and becomes the framing for an empirical result with a specific published position. If it fails, one of the themes was misread (most likely 3.5, possibly 3.3) and the next synthesis pass has a concrete anchor to correct against. Either outcome constrains the reading. Theme 3.6 remains a separate constraint either way: the matched-FLOP LoopFormer result is not affected by the rank measurement's outcome.
The second is a structural improvement to the synthesis process. The current version has a single reader and a single collation pass; an adversarial second-reader pass would be the cheapest upgrade. The most valuable version would be a second researcher running the same eleven topics against the same sampling frame and writing an independent synthesis, with the two readings then compared. We cannot afford this at the current project scale, but we flag it as the structural weakness the current format carries.
Three smaller follow-ups that are practical at the current scale: (a) repeat the multi-agent session on a six-month cadence and track which themes stay stable and which reorder; (b) build a running bibliography of papers that contradict the synthesis, rather than only papers that support it; (c) when the matrix-CODI result is in, write a short follow-up note comparing the pre-experiment reading to the post-experiment reading and reporting where the two diverged. The third is the one that would most directly address the confirmation-bias limitation.
07Reproducibility
The multi-agent session transcripts are internal and not archived in a public form. The outputs of the session — the per-topic research notes — are in the project repository under research/, including hypothesis-attack-april2026.md, project-analysis-march2026.md, thought-generation-feasibility-april2026.md, long-context-byte-models-april2026.md, and several others. The consolidated bibliography is references.md at the repository root, organized by topic. Every paper cited in this note is in that file; the reverse is not true — the bibliography contains many papers the synthesis did not use, and some of them are more important for individual threads than the ones cited here. The reading of the field below is the author's reading on a fixed date and is not independently replicated.
The cited papers themselves are all public and accessible through the arXiv or venue identifiers in the bibliography. A reader who wants to check the synthesis can start from the reference list, read the papers in whichever theme is most load-bearing for their question, and form their own reading. The reading does not require the synthesis to be correct — the themes are organized so that disagreement is legible at the per-theme level.
References
JEPA family
- Balestriero, R., & LeCun, Y. (Nov 2025). LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics. arXiv:2511.08544.
- Huang, H., LeCun, Y., et al. (Sep 2025). LLM-JEPA: LLMs Meet Joint Embedding Predictive Architectures. arXiv:2509.14252.
- Meta FAIR (Jun 2025). V-JEPA 2. arXiv:2506.09985.
- Meta FAIR (Mar 2026). V-JEPA 2.1. arXiv:2603.14482.
- Meta (Dec 2025). VL-JEPA. arXiv:2512.10942.
- Maes, Q., LeCun, Y., et al. (Mar 2026). LeWorldModel (LeWM). arXiv:2603.19312.
- (2025). WavJEPA. arXiv:2509.23238.
Pure-sensor / language-free SSL
- Siméoni, O., Vo, H., Seitzer, M., et al. (Aug 2025). DINOv3. Meta AI. arXiv:2508.10104.
- Fan, L., et al. (Apr 2025). Web-SSL: Scaling Language-Free Visual Representation Learning. Meta. arXiv:2504.01017.
- Liu, Y., et al. (Oct 2025). Data or Language Supervision: What Makes CLIP Better than DINO? EMNLP Findings 2025. arXiv:2510.11835.
- (Oct 2025). Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? arXiv:2510.24709.
Discrete vocabularies and byte-level models
- Meta (May 2024). Chameleon: Mixed-Modal Early-Fusion Foundation Models. arXiv:2405.09818.
- Meta (2025). BLT: Byte Latent Transformer. ACL 2025. arXiv:2412.09871.
- AI2 (Dec 2025). Bolmo: Byteification of OLMo 3. arXiv:2512.15586.
- BAAI (Oct 2025). Emu3.5: Native Multimodal Models are World Learners. arXiv:2510.26583.
- Mentzer, F., et al. (2024). FSQ: VQ-VAE Made Simple. ICLR 2024. arXiv:2309.15505.
- Zhao, Y., et al. (2025). BSQ: Image and Video Tokenization with Binary Spherical Quantization. ICLR 2025. arXiv:2406.07548.
- Yu, L., et al. (2024). MAGVIT-v2 / LFQ: Language Model Beats Diffusion. ICLR 2024. arXiv:2310.05737.
- Ma, C., Jiang, J., et al. (2025). UniTok. NeurIPS 2025 Spotlight. arXiv:2502.20321.
- IBM (2025). MBLM: Multiscale Byte Language Models. ICML 2025. arXiv:2502.14553.
- HKU + SambaNova (2025). EvaByte. hkunlp.github.io/blog/2025/evabyte/.
- Cornell (2024). MambaByte. COLM 2024. arXiv:2401.13660.
- Meta (2023). MEGABYTE. NeurIPS 2023. arXiv:2305.07185.
- Microsoft Research Asia (Feb 2024). bGPT. arXiv:2402.19155.
Structure vs scale
- He, Y., et al. (2025). HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts. NeurIPS 2025. arXiv:2505.24722.
- Brehmer, J., et al. (Jul 2025). Does equivariance matter at scale? TMLR. arXiv:2410.23179.
- Redhardt, M., Akram, A., & Schug, S. (2025). Scaling can lead to compositional generalization. NeurIPS 2025 Spotlight. arXiv:2507.07207.
- Wilson, A. G. (2025). Deep Learning is Not So Mysterious or Different. ICML 2025. arXiv:2503.02113.
Neuroscience of cognition and language
- Fedorenko, E., Piantadosi, S. T., & Gibson, E. (2024). Language is primarily a tool for communication rather than thought. Nature.
- Zaslavsky, N., Kemp, C., Regier, T., & Tishby, N. (2018). Efficient compression in color naming and its evolution. PNAS.
- Moser, E. I., et al. (2024). Grid Cells in Cognition: Mechanisms and Function. Annual Review of Neuroscience.
- (2024). Grid Cells and Abstract Reasoning. bioRxiv 2024.11.20.624569. biorxiv.org/content/10.1101/2024.11.20.624569v1.full.
- Chaudhuri, R., Gerçek, B., Pandey, B., Peyrache, A., & Fiete, I. (2019). Intrinsic attractor manifold and population dynamics. Nature Neuroscience.
- Khona, M., & Fiete, I. (2022). Attractor and integrator networks in the brain. Nature Reviews Neuroscience.
- Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by sparse coding. Nature.
- Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex. Nature Neuroscience.
- Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience.
- Parr, T., Pezzulo, G., & Friston, K. (2022). Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. MIT Press.
- (2024). Dimensionality of Cognition. Trends in Cognitive Sciences. cell.com/trends/cognitive-sciences/fulltext/S1364-6613(24)00189-X.
- (Feb 2025). PFC Dimensionality and Cognitive Control. Journal of Neuroscience, 45(6):e0233242024. jneurosci.org/content/45/6/e0233242024.
Continuous reasoning
- Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., & Tian, Y. (2024). Training Large Language Models to Reason in a Continuous Latent Space (COCONUT). Meta, ICLR 2025. arXiv:2412.06769.
- Shen, Z., et al. (2025). CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation. EMNLP 2025. arXiv:2502.21074.
- Gozeten, A., Ildiz, M. E., Zhang, Y., Harutyunyan, H., Rawat, A. S., & Oymak, S. (2025). Continuous Chain of Thought Enables Parallel Exploration and Reasoning (CoT2). ICLR 2026. arXiv:2505.23648.
- Xiaomi (2025). CoLaR: Dynamic Latent Compression of Reasoning Chains. NeurIPS 2025. arXiv:2505.16552.
- (Sep 2025). MarCos: Markov Chain of Continuous Thoughts. arXiv:2509.25020.
- (Sep 2025). SIM-CoT: Supervised Implicit Chain of Thought. arXiv:2509.20317.
- Wu, et al. (Jun 2025). PCCoT: Parallel Continuous Chain-of-Thought with Jacobi Iteration. arXiv:2506.18582.
- Zhu, H., et al. (2025). A Survey on Latent Reasoning. arXiv:2505.16782.
- (Sep 2025). Emergence of Superposition. arXiv:2509.23365.
- Zhu, H., et al. (May 2025). Reasoning by Superposition. arXiv:2505.12514.
- (2026). The Illusion of Superposition. arXiv:2604.06374.
Rank and dimensionality measurement
- Dong, Y., Cordonnier, J.-B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. ICML 2021. arXiv:2103.03404.
- (Aug 2025). Dimensional Collapse in Transformer Attention Outputs. arXiv:2508.16929.
- (Oct 2025). Understanding Transformers for Time Series: Rank Structure. arXiv:2510.03358.
Other threads investigated but not themed
- (Jul 2025). Energy-Based Transformers. arXiv:2507.02092.
- Stanford (2024). TTT-Linear / TTT-MLP. arXiv:2407.04620.
- DeepSeek (2025). Native Sparse Attention (NSA). ACL 2025 Best Paper. arXiv:2502.11089.
- DeepSeek (2025). DeepSeek Sparse Attention (DSA). DeepSeek-V3.2. arXiv:2512.02556.
- Moonshot AI (Oct 2025). Kimi Linear. arXiv:2510.26692.