The Gradient Does Not See Rank

abstract

Continuous chain-of-thought models compress reasoning into latent tokens. Matrix-valued variants introduce rank as a single-sample structural observable on the latent matrix Z. If matrix latents carry parallel reasoning paths via superposition, rank should track them, and truncating Z to low rank should hurt accuracy on tasks whose solutions plausibly require multiple components. Across four training regimes of a matrix-CODI model (three on ProsQA, one on GSM8K-Aug below the learning threshold), the rank-k projection ablation curve is flat to within 0.6 percentage points. A three-seed replication yields 81.5 ± 1.2pp accuracy while the final effective rank of Z spans {4, 12, 13}; the loss does not reward any particular rank. To test whether rank-blindness arises from the flatten-then-project readout alone, we trained four readouts: a bilinear reparametrization, a bilinear-plus-GELU readout nonlinear in Z, an SVD-augmented readout feeding singular values through an MLP, and a quadratic readout in ZZ^⊤. All four rank-k curves remain flat (Spearman p-values 0.63, 0.14, 0.82, 0.46). The flat curves persist for readouts nonlinear in Z. A linear probe on Z underperforms a raw pretrained hidden state at target prediction (AUC 0.673 vs. 0.846). A negative control on vanilla GPT-2 SFT (no matrix bottleneck, no Z, three seeds, n=500) reproduces a flat rank-k curve under the same intervention paradigm with pooled-mean range 0.20pp, and a random-h sensitivity floor lands at the same accuracy: the rank-k ablation alone conflates rank-blindness with position-irrelevance.

01Introduction

Continuous chain-of-thought (CoT) models replace explicit textual reasoning steps with continuous latent tokens fed back into the transformer's residual stream. COCONUT [2] and CODI [5] are representative instances: both compress an explicit rationale into a small number of continuous latent positions and decode the answer from the resulting state. Theoretical work [3, 4] argues that these latents can hold multiple reasoning paths in superposition, so continuous CoT could explore a search tree in parallel.

Rizvi-Martel et al. [8] pushed back: a fine-tuned COCONUT model reaches 96.6% on ProsQA without feeding back any latent tokens, against 99.0% with latents and 85.3% for explicit CoT. They named this the Illusion of Superposition.

Matrix-valued latents [5] make the measurement concrete: a d × d thought Z has a computable rank via its SVD. If each singular direction encodes a separate reasoning path, truncating Z to rank k at inference should degrade accuracy when the task needs more than k paths. The rank-k ablation curve is the natural probe.

We report five results on a matrix-CODI bottleneck (GPT-2 small, d = 16, six latent positions, ProsQA):

Rank-k ablation is flat at two distillation weights, with the multiplicative thinker on or off, and on GSM8K-Aug as well as ProsQA. Range across k ∈ {1, 2, 4, 8, 16} is ≤ 0.6 pp.
Three seeds at otherwise identical hyperparameters land at effective ranks {4, 12, 13} and accuracies 81.51 ± 1.2pp.
Four nonlinear-in-Z readouts (bilinear, bilinear+GELU, SVD-augmented, quadratic in ZZ^⊤) also give flat curves; Spearman p in [0.14, 0.82].
A linear probe on Z (1536 features across six positions) reaches AUC 0.673 on ProsQA target prediction; a pretrained GPT-2 hidden state at 768 features reaches 0.846 on the same task.
Rank-k on vanilla GPT-2 SFT (no Z, three seeds) reproduces a flat curve under the same intervention paradigm; pooled range 0.20 pp.

The seed-level rank spread is what distinguishes the rank-blindness reading from position-irrelevance.

02Background and setup

CODI distillation. CODI [5] trains a student to compress an explicit chain-of-thought into a fixed number of continuous latent positions by matching a hidden-state target from a teacher pass. The teacher pass consumes prompt + CoT + answer and produces a reference hidden state at a designated colon-token position. The student pass consumes prompt + n latent positions + answer, where each latent position is produced by feeding the previous step's hidden state back as the next input embedding. A hidden-state L1 loss (the distillation loss) aligns the student's state at the answer colon to the teacher's, and a standard next-token cross-entropy loss trains the answer prediction. The total loss is ℒ = γ·ℒ_kd + ℒ_ce.

Why matrix latents. Vector representations are flat: a hidden state of dimension D has D scalar entries with no native notion of how many independent components it superposes. Matrix-valued representations have an additional structural observable: rank, the number of independent directions used. Matrix-valued working memory has a long lineage in neural networks — fast-weight networks (Schmidhuber 1992; Schlag, Irie, Schmidhuber 2021), linear attention (Katharopoulos et al. 2020), the xLSTM family (Beck et al. 2024), and the Mamba/SSM family (Gu & Dao 2023) all use matrix-shaped state — and recent work measures the rank of these states as a diagnostic of how the model uses its memory [9, 10]. In all of those, the matrix lives inside a layer (an attention head's accumulator, an SSM's recurrent state) and serves as recurrent memory. Matrix-CODI extends the lineage to a different placement: the matrix sits on the explicit chain-of-thought feedback path, where each latent reasoning position is a d × d matrix rather than a vector. The rank of that matrix is a candidate measure of how many reasoning paths the model holds in superposition at that step. The hypothesis we test is whether rank, so defined, behaves as that measure under CODI-style training.

Matrix bottleneck. A matrix-CODI model forks CODI and inserts a d × d matrix bottleneck on each latent position's feedback path. Given the previous latent hidden state h ∈ ℝ^D (here D = 768 for GPT-2 small), the bottleneck maps it up to a d × d matrix Z, optionally applies a multiplicative thinking step Z ← (I + Δ(Z)) Z (I + Γ(Z)), and projects back to the residual dimension via a readout φ : ℝ^d×d → ℝ^D. The default readout is φ(Z) = W_down · vec(Z) — a flatten-then-project. We use d = 16 throughout.

Rank-k ablation and effective rank. At inference, compute the SVD Z = UΣV^⊤ and replace Z by its rank-k truncation before the readout. If rank is functional, accuracy should drop as k decreases to 1; if rank is vestigial, the curve is flat. The numerical rank of Z is always d since it is a dense trained matrix, so we report effective rank exp(−Σ_i σ̃_i log σ̃_i) with σ̃_i = σ_i / Σ_j σ_j for training curves; the ablation itself uses hard top-k truncation. The question is whether the structural capacity offered by rank > 1 is functionally used.

ProsQA. ProsQA [2] is a synthetic entailment task: a diamond-shaped DAG of entities with a question of the form which leaf entity has property P?. Each problem has a unique positive answer and a single distractor. We use the training split from facebookresearch/coconut; the test set has 128 problems, matching the original CODI release.

03Four flat rank-k curves

We ran the matrix bottleneck under four training conditions that vary the task, the CODI distillation weight γ, and the multiplicative thinker.

run	task	γ	thinker	Z rank	k=1	k=16	Spearman r
R1	GSM8K-Aug	1.0	on	~5.5	6.00%	6.12%	−0.023
R2	ProsQA	1.0	on	~10.2	78.4%	78.4%	+0.026
R3a	ProsQA	0.0	on	~12.7	76.8%	76.6%	−0.105
R3b	ProsQA	0.0	off	~12.8	72.6%	72.4%	+0.095

table 1Rank-k projection ablation across four training conditions. Z rank is mean effective rank at eval time (exp of singular-value entropy). Range across k ∈ {1, 2, 4, 8, 16} is ≤ 0.6 pp in every row. R1 (GSM8K-Aug) is at a 6% operating point below the learning threshold and is not interpretable on its own.

Varying the task (arithmetic vs. logical), the distillation weight (removing the L1-at-colon loss raises effective rank from ~10 to ~13 but leaves the curve shape unchanged), or the multiplicative thinker (off drops accuracy by ~1 pp; curve still flat) does not bend any of the four curves. A vanilla GPT-2 SFT model with no Z at all also produces a flat curve under the same probe (§06, negative control). The three-seed decoupling result in §05 is the model-level evidence the negative control cannot reproduce by construction.

04The readout Jacobian is constant in Z

The flatten-then-project readout φ(Z) = W_down · vec(Z) is linear in Z. Its Jacobian

∂φ(Z)/∂Z[i, j] = W_down[:, ij]     (constant in Z)

does not depend on Z. The loss gradient with respect to Z is therefore a vector contracted with a constant tensor: the contraction preserves the row-space of W_down but carries no information about the singular structure of Z, and ℒ itself provides no gradient signal preferring one rank of Z over another. This is local to ℒ; implicit bias from the optimizer (Adam + weight decay) and upstream parameter regularization may still shape rank through channels outside ℒ (see §07).

The four training conditions in Table 1 are consistent with this prediction. §06 tests what happens when the sufficient condition is violated.

05Three-seed accuracy-rank decoupling

If the loss does not reward rank, two training seeds of the same configuration should be allowed to land at different ranks while reaching comparable accuracy. We ran three seeds (1337, 42, 7) of the flatten-readout configuration on ProsQA with γ = 0, 25 epochs, batch 16, AdamW at lr = 10⁻⁴.

figure 1Three seeds of the same flatten-readout configuration. Accuracy is tight (81.51 ± 1.2pp, shaded band is mean ± std), but the final effective rank of Z varies 3× (seed 42 converges at rank ~4; seeds 7 and 1337 converge near rank 12). The loss does not push Z toward any particular rank.

The same loss lands at effective ranks spanning a 3× range, at matching accuracy. This directly demonstrates that the matrix-bottleneck training objective is rank-agnostic — stronger than the flat rank-k curve alone, because it shows the loss landscape is flat along rank-changing directions of Z.

result 1 Three training seeds of the same configuration yield ProsQA accuracies 80.47%, 81.25%, 82.81% (81.51 ± 1.2pp) while converging to final effective ranks 12.9, 4, 12 — a 3× spread in rank at matching accuracy. Rank is a free direction in the loss landscape.

06Positive control: four readouts nonlinear in Z

The Jacobian argument in §04 gives a sufficient condition for the readout Jacobian to be constant in Z. The natural prediction is that a readout with non-constant Jacobian — explicitly nonlinear in Z — should bend the rank-k curve. We tested four variants, replacing only the readout φ and leaving everything else (γ=0, d = 16, six latent positions, seed 1337, 25 epochs, batch 16) identical. All four are trained at γ=0 (no CODI L1-at-colon term), so the flat curves below are about the cross-entropy loss through the matrix bottleneck, not specifically the CODI distillation loss.

Bilinear. φ(Z) = W · [u_k^⊤ Z v_k]_k=1..d². Reparametrization of flatten. Still linear in Z. Serves as a control.
Bilinear + GELU. φ(Z) = W · GELU([u_k^⊤ Z v_k]). Nonlinear in Z.
SVD-augmented. φ(Z) = W_down · vec(Z) + MLP(σ(Z)) where σ(Z) are the singular values of Z. Explicitly exposes rank.
Quadratic. φ(Z) = W_down · vec(concat(Z Z^⊤, Z^⊤ Z)). Second-moment readout. Quadratic in Z.

rank-k ablation curves for four readouts nonlinear in Z (bilinear, bilinear+GELU, SVD-augmented, quadratic) on 128 ProsQA test problems, all flat, with the flatten baseline shown as a reference line — figure 2Rank-k ablation curves for four readouts on 128 ProsQA test problems. The four positive-control readouts — bilinear reparametrization, bilinear+GELU (explicitly nonlinear in Z), SVD-augmented (feeds singular values through an MLP), and quadratic (second moment) — all produce curves flat to within ~0.8 pp. Quadratic is perfectly flat at 79.69% across all five k. The dashed reference line is the flatten baseline (79.0%, from Table 2). Data: experiment-runs/2026-04-17_round_pc/rank_evals/{bilinear,bilinear_gelu,svd_aug,quadratic}_rankeval.json, md5-verified, accuracy recomputed from per-problem records. Script: assets/plots/generate_matrix_codi_rank_ablation.py.

readout	k=1	k=2	k=4	k=8	k=16	Spearman r	p
flatten	79.0%	79.0%	79.0%	79.0%	79.0%	~0	flat
bilinear	78.12%	78.91%	78.91%	78.12%	78.12%	+0.04	0.63
bilinear + GELU	78.91%	79.69%	79.69%	79.69%	79.69%	−0.13	0.14
SVD-augmented	77.34%	78.12%	78.12%	77.34%	78.12%	+0.02	0.82
quadratic	79.69%	79.69%	79.69%	79.69%	79.69%	+0.07	0.46

table 2Rank-k ablation on 128 ProsQA problems, by readout. All four positive-control Spearman p-values are above 0.14. Quadratic is identical across all five k. The bilinear, bilinear + GELU, SVD-augmented, and quadratic rows are independently reproduced from raw per-problem records at experiment-runs/2026-04-17_round_pc/rank_evals/ (md5-verified; see fig 2). The flatten row is carried as originally reported; a 2026-07 site audit could not locate a raw 128-problem eval archive reproducing it exactly, and 79.0% × 128 = 101.12 is not an integer number of problems, unlike the four verified rows — the value is very likely a rounded restatement of an eval run under slightly different conditions (the archived rankeval.json files log epochs=10, while this section's prose states 25) rather than an error, but it has not been re-verified bit-for-bit and is flagged here rather than silently treated as equally certain.

result 2 All four positive-control p-values are above 0.14; the quadratic readout is identical across all five k. The SVD-augmented readout, which exposes singular values directly to the optimizer, does not produce a rank-dependent curve either. Readouts with non-constant Jacobians still produce flat rank-k curves.

A plausible refinement: the trained readout's Jacobian at test inputs has an effectively rank-1 active subspace in Z, regardless of whether the readout is in-principle nonlinear. In the absence of an objective term that rewards rank, every readout family tested admits a rank-1 shortcut and the optimizer takes it. We have not yet measured erank(J(Z)) on these checkpoints to test this directly. Combined with the three-seed decoupling (effective ranks {4, 12, 13} at matched accuracy), the rank of Z in matrix-CODI is a free direction in the loss landscape, and the four readouts above do not constrain it.

Negative control: rank-k ablation on vanilla GPT-2 SFT

If the flat rank-k curves were specific to the matrix-bottleneck objective, running the same probe on a model with no bottleneck should bend them. We ran that test on a vanilla GPT-2 small fine-tuned for ProsQA via standard supervised fine-tuning (no latent tokens, no Z, no distillation). Three seeds {1337, 42, 7} trained to ~79pp ProsQA accuracy matching the paper's vanilla baseline. We then construct a fake Z at inference by reshaping the first 256 dimensions of h into a 16×16 matrix at the six token positions immediately preceding the answer-prefix colon (the analog of matrix-CODI's six latent positions), apply rank-k truncation to the fake Z via SVD, and propagate the modified residual through the remaining transformer blocks. Decoding uses no KV cache, so the intervention re-fires at every step.

seed	k=1	k=2	k=4	k=8	k=16
1337	79.80	80.20	80.00	80.20	80.00
42	79.00	78.80	78.60	78.60	79.00
7	78.00	78.00	78.00	78.00	78.20
pooled	78.93	79.00	78.87	78.93	79.07

table 3Negative control: rank-k ablation on vanilla GPT-2 SFT (no matrix bottleneck, no Z). Fake Z built from h[:256] reshaped to 16×16 at the six analog-latent positions. Pooled-mean range across k is 0.20 pp. Per-seed Spearman r_s: +0.32, −0.16, +0.71 (n=500 test problems each).

As a sensitivity floor, replacing h at the same six positions with i.i.d. Gaussian noise matched in mean and standard deviation produces seed accuracies {79.6, 79.2, 78.2}pp, statistically indistinguishable from the unablated {80.0, 78.8, 78.0}pp. The intervention paradigm is uninformative on this model.

A flat rank-k curve is consistent with two states: a rank-blind objective, or positions that do not carry the task's information. Vanilla SFT is in the second by construction. The rank-k ablation alone cannot distinguish them. In matrix-CODI, the bottleneck forces information through Z at those positions during training, and the trained Z reaches effective rank 12–13 at γ=0. The seed-decoupling result in §05 is a model-level property a position-irrelevance reading would not predict.

07Depth and scale do not rescue matrix-CODI

Two natural follow-up questions are whether the failure is specific to six latent positions and whether it persists at larger scale. We ran a depth sweep (vanilla CODI with iterative refinement) and a scale sweep (vanilla SFT and matrix-CODI at gpt2-small, gpt2-medium, gpt2-large).

vanilla SFT

matrix-CODI

figure 3Scale sweep on ProsQA. Matrix-CODI tracks below vanilla SFT at every tested scale. Vanilla SFT itself degrades at gpt2-large — ProsQA's 17,886 training examples appear to be too few for the larger backbone at default learning rate. Matrix-large is pending (batch 4 and batch 2 both OOM).

figure 4Depth sweep (vanilla CODI, no matrix bottleneck). At n = 6 latent refinement steps the model reaches 78.91%, below the no-refinement vanilla SFT baseline (81.77%). The n = 16, 32, 64 points re-run at smaller batches are pending and shown as translucent placeholders.

Two observations. First, adding iterative latent refinement at n = 6 hurts ProsQA accuracy — the "more latent thought means better reasoning" story fails at this scale independently of the matrix question. Second, matrix-CODI tracks below vanilla SFT at both gpt2-small and gpt2-medium; we do not have the large matrix point yet. Vanilla SFT itself degrades at gpt2-large on ProsQA — we read that as a data-size artefact of ProsQA (17,886 training examples) under default AdamW at the larger backbone, not as a finding about superposition. Matrix-CODI does not rescue the regression.

08What does Z encode, if not rank?

If rank is not functional, we should ask what is encoded in the matrix thought. We ran a 5-fold cross-validated multi-class logistic regression to predict the ProsQA target class from a flattened Z on 500 held-out test problems. Controls: the same prompt's hidden state from a pretrained GPT-2 with no ProsQA fine-tuning, and the same model at random initialization.

figure 5Linear probe AUC for ProsQA target class prediction. Pre-registered threshold for a positive result was max(vanilla, random) + 0.05 = 0.896. The matrix Z concat AUC of 0.673 does not exceed it. Vanilla GPT-2 — never trained on ProsQA — predicts the target class better than the trained matrix-CODI bottleneck.

result 3 The matrix-CODI bottleneck's Z concatenated across 6 positions (1536 features) reaches AUC 0.673 at predicting the ProsQA target class. Vanilla pretrained GPT-2, never trained on ProsQA, reaches 0.846 at 768 features. Despite having more features, the matrix bottleneck carries less target-predictive information than the raw pretrained hidden state. A dimension-matched comparison (probe on the post-bottleneck reconstructed 768-dim hidden state that the downstream transformer consumes) is pending. A binary target-vs-distractor probe on the same Z tensors is at chance (AUC 0.50–0.56) across all conditions.

09Related work

Latent CoT and the Illusion of Superposition. COCONUT [2] replaces explicit rationales with continuous latent tokens; CODI [5] compresses CoT into latents via self-distillation. Rizvi-Martel et al. [8] report that a fine-tuned COCONUT reaches 96.6% on ProsQA without latents, 99.0% with them, and 85.3% with explicit CoT — the latent machinery is not doing measurable work in their fine-tuned setting. Our vanilla SFT baseline at GPT-2 small reaches 81.77%, roughly 15 pp below their 96.6%; we did not close this gap. The qualitative phenomenon they report — fine-tuned latent CoT models reaching comparable accuracy without their latents — replicates at our operating point: matrix-CODI at 82.03% vs. pure SFT at 81.77% (gap 0.26 pp, within three-seed noise). Our contribution relative to that work is a structural argument about the training objective that does not depend on matching their accuracy: the matrix-bottleneck objective produces rank-indifferent gradients, and four readouts nonlinear in Z are unable to bend the rank-k curve.

SIM-CoT. Shen et al. [6] diagnose latent CoT instability as insufficient step-level supervision and propose injecting per-step targets. Our diagnosis is at a different layer (the matrix bottleneck's objective produces rank-indifferent gradients, adjudicated by four positive-control readouts).

Reasoning by Superposition and CoT2. Zhu et al. [3] prove that a two-layer transformer with D steps of continuous thought can solve directed graph reachability, with each thought encoding a parallel BFS frontier. Gozeten et al. [4] show similar parallel-exploration behavior under a GRPO-style training regime. Both are theoretical capacity results with small empirical demonstrations. Capacity and what-gets-learned are distinct; our result is about what CODI distillation shapes, not whether transformers can in principle encode superposition.

February 2026 rank measurements (direct adjacency). Nazari & Rusch [9] measure the effective rank of linear-attention hidden states and propose post-training rank pruning of K and Q matrices. State Rank Dynamics [10] reports "state-rank stratification" during pretraining: linear-attention heads bifurcate into persistently low-rank and high-rank groups. Both papers measure rank in the fast-weight memory inside an attention layer (a d × d accumulator), and both make descriptive claims. Our object of study is different — the explicit per-position matrix latents Z on the matrix-CODI feedback path, not a fast-weight memory inside attention — and the claim is a mechanism claim about the training objective that we test by constructing four nonlinear-in-Z positive controls.

Dynamics within latent CoT. Anonymous [7] run multiple intervention protocols on latent CoT hidden states (zero, mean, step-wise mean, Gaussian noise) and an early-stop decoding that truncates latent computation after step k. Their early-stop decoding is a cousin of our rank-k ablation on a different axis — step depth vs. spectral truncation.

Rank decay from depth. Dong et al. [1] showed that pure attention loses rank doubly exponentially with depth. Their subject is the rank of activations across a stack of attention layers; ours is the rank of an explicit matrix latent on a feedback path. The two phenomena are distinct.

Implicit low-rank bias. Gradient descent on matrix-factorization losses has an implicit bias toward low-rank solutions [11, 12, 13], and Kobayashi et al. [14] show that weight decay specifically induces low-rank attention products via an equivalence with nuclear-norm regularization. These concern bias through the parameter space; the Jacobian argument in §04 concerns the loss gradient through the readout. Our three-seed effective rank spread {4, 12, 13} under AdamW at weight decay 0.01 is consistent with an implicit low-rank attractor plus seed-dependent convergence, and inconsistent with strong low-rank collapse (which would predict all three seeds at the same low rank).

Alternative substrates and probe critique. Wang et al. [15] argue that latent reasoning lives in the vocabulary column space, not in the SVD directions of the hidden state; if so, rank-k truncation on Z is the wrong observable. Li & Janson [17] show that zero/resample ablations (of which rank-k truncation is an instance) overestimate component importance relative to optimal ablation; that cuts in our favor (optimal ablation would make the curves flatter, not bumpier). The negative control in §06 is the empirical analog: running rank-k on a model with no Z reproduces a flat curve, so the probe alone cannot isolate rank-blindness from position-irrelevance.

10Discussion and limitations

Single task and architecture family. All core experiments run on ProsQA on GPT-2 small/medium/large. The GSM8K-Aug result in Table 1 is at a 6% operating point where the model is barely learning the task; it is not strong evidence on its own. The Jacobian argument in §04 is stated in terms of the readout φ and is therefore architecture-agnostic, but the empirical evidence covers only this scale family. Cross-dataset replication on GSM8K at a higher-accuracy operating point is pending.

Seed-dependent Z rank. The three-seed decoupling is a separate finding: the same configuration, varying only the seed, produces models at effective ranks {4, 12, 13} with accuracies {81.25, 82.81, 80.47}. Three seeds do not give statistical power to claim the rank distribution is flat or uniform; the narrower claim is that seeds at otherwise identical hyperparameters converge to materially different effective ranks, inconsistent with a strong loss-side preference for a specific rank. Implicit regularization from the optimizer (Adam + weight decay) may still shape rank through channels outside ℒ. An n=10 replication is pending.

One seed per positive control. The four positive-control variants in §06 were each trained once (compute-bounded). Spearman p-values are computed on 128 test problems per checkpoint, which limits power for small effects. A three-seed replication per variant (~42 H100-hours) and re-running the four positive-control rank-k evaluations on the full 500-problem ProsQA test set are both pending; the 500-problem eval raises power to detect |r_s| ≥ 0.15 from ~40% to ~80% at α=0.05.

Alternative explanation: the task is rank-1-solvable. ProsQA has a unique positive answer and a single distractor. If all answer-predictive information lives in one singular direction of Z, every architecture would converge to a rank-1 functional solution and rank-k truncation would be flat. Our data are consistent with that. What sits awkwardly with the strong reading is that the trained Z reaches effective rank 12–13 at γ=0 instead of collapsing to 1, and three seeds spread to {4, 12, 13} rather than concentrating. The model builds rank it does not functionally use, and the rank it builds is seed-dependent. A reasoning task whose ground truth provably requires k > 1 independent quantities at the answer position would disambiguate; we do not have one at this scale.

A result that would revise our reading: a readout that bends the rank-k curve on ProsQA (or a comparable structured task) under a matrix-bottleneck objective. The four readouts in §06 were chosen to maximize the chance of seeing one. A different training objective that explicitly rewards rank is a separate direction we do not address.

11Conclusion

The matrix-bottleneck training objective in CODI does not reward rank: the readout Jacobian carries no rank information through the chain rule, the flat rank-k curves are insensitive to nonlinear readouts that escape the linear-Jacobian shortcut, and three seeds under matched hyperparameters land at effective ranks {4, 12, 13} with statistically indistinguishable accuracy. The rank-k probe alone could not distinguish rank-blindness from position-irrelevance; the seed-level rank spread does.

12Reproducibility

The paper as published at the ICML 2026 Mechanistic Interpretability Workshop: the-gradient-does-not-see-rank.pdf (213 KB).

All training, evaluation, and probe code is released at github.com/saml212/matrix-states. The release includes:

run_matrix_codi.py: the matrix-CODI training script. The MatrixBottleneck class implements the w_up → reshape → thinker → flatten → w_down pipeline. All five readouts in §06 are selectable via the --readout flag.
probe_z.py: the linear probe pipeline that produces the AUC numbers in Figure 5.
rank_eval.py: the rank-k projection evaluator. Given a checkpoint, it computes the accuracy-vs-k curve and the Spearman correlation between per-sample effective rank and correctness.
Raw rank-k evaluation JSONs (Figure 2 and Table 2): experiment-runs/2026-04-17_round_pc/rank_evals/.
Experiment log with per-run hyperparameters, rank trajectories, and wall-clock times: EXPERIMENT_LOG.md.
ProsQA from the facebookresearch/coconut release. GSM8K-Aug from the CODI release (whynlp/gsm8k-aug). Backbones: gpt2, gpt2-medium, gpt2-large.

Training hardware: a single NVIDIA H100 (80GB HBM3) per run. All reported numerical results in the main body trace to a specific checkpoint and evaluator run in the release.

References

Dong, Y., Cordonnier, J.-B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. ICML 2021. arXiv:2103.03404
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., & Tian, Y. (2024). Training large language models to reason in a continuous latent space (COCONUT). arXiv:2412.06769
Zhu, H., Hao, S., Hu, Z., Jiao, J., Russell, S., & Tian, Y. (2025). Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought. NeurIPS 2025. arXiv:2505.12514
Gozeten, A., Ildiz, M. E., Zhang, Y., Harutyunyan, H., Rawat, A. S., & Oymak, S. (2025). Continuous Chain of Thought Enables Parallel Exploration and Reasoning (CoT2). arXiv:2505.23648
Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y., & He, Y. (2025). CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation. EMNLP 2025. arXiv:2502.21074
Shen, Z., et al. (2026). SIM-CoT: Step-level Implicit Supervision for Continuous Chain of Thought. ICLR 2026. arXiv:2509.20317
(2026). Dynamics Within Latent Chain of Thought. arXiv:2602.08783
Rizvi-Martel et al. (2026). The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models. arXiv:2604.06374
Nazari, P., & Rusch, T. K. (2026). The Key to State Reduction in Linear Attention: A Rank-based Perspective. arXiv:2602.04852
(2026). State Rank Dynamics in Linear Attention LLMs. arXiv:2602.02195
Gunasekar, S., et al. (2017). Implicit Regularization in Matrix Factorization. NeurIPS 2017. arXiv:1705.09280
Arora, S., Cohen, N., Hu, W., & Luo, Y. (2019). Implicit Regularization in Deep Matrix Factorization. NeurIPS 2019. arXiv:1905.13655
Razin, N., & Cohen, N. (2020). Implicit Regularization in Deep Learning May Not Be Explainable by Norms. NeurIPS 2020. arXiv:2005.06398
Kobayashi, S., Akram, Y., & von Oswald, J. (2024). Weight Decay Induces Low-Rank Attention Layers. NeurIPS 2024. arXiv:2410.23819
Anonymous (2025). Latent Reasoning in LLMs as a Vocabulary-Space Superposition. arXiv:2510.15522
Fan, Q., Huang, H., & He, R. (2025). Breaking the Low-Rank Dilemma of Linear Attention. CVPR 2025. arXiv:2411.07635
Li, M., & Janson, L. (2024). Optimal Ablation for Interpretability. NeurIPS 2024. arXiv:2409.09951

The gradient does not see rank: rank-indifference in matrix-CODI on ProsQA.