Per-layer parameter efficiency of matrix projections

abstract

At matrix dimension d = 16, a RowThenCol bilinear projection — silu(A · M) · B with A, B ∈ ℝ^d×d — uses 2·d² = 512 parameters. The flattened equivalent — reshape M to a d²-dim vector and apply a d² × d² Linear layer — uses d⁴ = 65,536 parameters. The ratio is 128×. The same structural argument gives 32× at Kronecker K = 4 and 16× at Kronecker K = 8. This is a per-layer claim at one matrix dimension. It is not a claim that matrix operations are better at matched compute: the RowThenCol projection is O(d³) per step while the flattened Linear is O(d⁴) only because the layer carries d⁴ parameters, and the per-parameter efficiency comes at a per-step FLOP constant we expect memory bandwidth and kernel launch overhead to eat much of on an H100; we have not measured the realized speedup (see Limitations). The allocation consequence at our scale is that in an 8-layer Matrix Thinker at d = 32 the 8 thinking layers occupy ~4% of the parameter budget while embeddings and output head occupy the rest; at byte-level d = 16 (Run 19, 218K total parameters) the thinking layers expand to ~33% of the budget as the embedding table shrinks. The structural efficiency is useful where parameters are scarce and compute is not the binding constraint.

01Background

Per-layer parameter efficiency has a long history in structured linear algebra. Tensor decompositions cut parameters by factoring a weight tensor into a product of smaller tensors. Low-rank adaptation approaches restrict gradient updates to a rank-r subspace of a full weight matrix. Block-diagonal and other structured matrix families cut parameters by imposing structural sparsity. Parameterized Hypercomplex Multiplication (PHM) from Zhang et al. 2021 [3] expresses a linear layer as a sum of Kronecker products and reduces parameters by a factor of n at cost of an n-way Kronecker sum. Bilinear pooling from the fine-grained visual recognition literature uses outer products to produce second-order features with far fewer parameters than the full quadratic expansion would need.

The common structure across these methods is that a linear map from ℝ^m to ℝⁿ that would cost m · n parameters is replaced with a factored form whose parameter count scales sub-linearly in m · n. The catch is that the factoring constrains the map: not every m × n matrix can be represented as a low-rank product or a sum of Kronecker terms, so there is a capacity ceiling the full dense layer does not have. The useful regime is the one where the constrained class is expressive enough for the task and the parameter budget is the binding constraint.

Matrix-valued token representations change the unit of computation from a d-dim vector to a d × d matrix. The natural "linear layer" in this setting is a map from one d × d matrix to another. If the matrix is immediately flattened, the layer becomes a standard d² × d² dense Linear with d⁴ parameters. If the matrix is kept as a matrix and the layer is built from left- and right-multiplications, the layer uses two d × d parameter tensors for 2 · d² total parameters per projection. The two are not interchangeable maps — the second is a constrained subspace of the first — but the second is what the rest of the Matrix Thinker stack is already wired to consume through row-wise and column-wise attention, RowThenCol projections, and bilinear output heads. The question this note addresses is how large the per-layer parameter gap is and what the allocation consequence looks like when the gap is applied to an 8-layer stack.

02Setup

Parameter and FLOP counts in this note come from two internal project research notes (research/matrix-native-projections.md and research/matrix-native-operations-code.md); these are internal project research notes, not external peer-reviewed sources. FLOP counts in this note use the 2-FLOP-per-fused-multiply-add convention from research/matrix-native-projections.md; the alternative 1-FLOP-per-FMA convention (used in research/matrix-native-operations-code.md) gives half the absolute numbers but preserves the 8× ratio.

The two projections

We compare two ways to realize a learned map from a d × d input matrix M to a d × d output matrix, at d = 16:

Flattened Linear. Reshape M to a d²-dim vector, apply a dense Linear of shape (d², d²), reshape back. Parameter count: d² · d² = d⁴ = 65,536. This is the map that can realize any linear function from ℝ^d² to ℝ^d². It is also what a standard transformer's (d_model, d_model) Linear layer becomes if we take d_model = d² and treat the d²-dim vector as a flattened matrix.
RowThenCol bilinear. Keep M as a matrix. Apply two learned d × d parameters A and B:
```
M_out = silu(A @ M) @ B
```
A @ M is a learned row mixing (each column of the output is a learned linear combination of the columns of M, via the row weights of A). silu is applied pointwise. (silu(A @ M)) @ B is a learned column mixing. The composition is a nonlinear sandwich with parameter count 2 · d² = 512. This is the projection we use for Q, K, V, output, and the data-dependent Δ and Γ inside the multiplicative thinking layer. Source: matrix-thinking/src/matrix_thinker.py, class RowThenColProjection.

For completeness we also include two Kronecker variants from the project's earlier research on matrix-native projections:

Kronecker K=4. M_out = Σ_k=1..4 A_k · M · B_k. Four sum terms, two matrices per term (A_k, B_k), d² params per matrix: 2·K·d² = 2·4·256 = 2,048. Source: [1].
Kronecker K=8. Same form with eight terms: 2·K·d² = 2·8·256 = 4,096 at d = 16.

Parameter counts

The four numbers are fixed by the projection form and the matrix dimension. They are not run-dependent:

projection	form	params at d=16	vs flattened
Flattened Linear	d² × d² dense	65,536	1.0×
Kronecker K=8	Σ_k=1..8 A_k · M · B_k	4,096	16× fewer
Kronecker K=4	Σ_k=1..4 A_k · M · B_k	2,048	32× fewer
RowThenCol bilinear	silu(A · M) · B	512	128× fewer

fig 1Parameter count per projection at d = 16 for four projection families. Counts from research/matrix-native-projections.md and research/matrix-native-operations-code.md. The "vs flattened" column is 65,536 / params. The headline ratio for RowThenCol is 128×.

Per-step FLOPs, so nothing is hidden

The parameter efficiency does not come for free. At d = 16 the per-step FLOP counts for the same four projections are:

projection	params	flops (fwd)	flops / param
Flattened Linear	65,536	131,072	2
Kronecker K=8	4,096	131,072	32
Kronecker K=4	2,048	65,536	32
RowThenCol bilinear	512	16,384	32

fig 2FLOPs per projection per token at d = 16. Counts from research/matrix-native-projections.md. A flattened Linear has FLOPs ≈ 2 × params (one multiply-add per parameter per token). Matrix-native projections apply each parameter O(d) times in the inner loop, so FLOPs / param rises from 2 to 32 at d = 16. The RowThenCol bilinear uses 8× fewer FLOPs than the flattened Linear per step, but that is a weaker speedup than the 128× parameter gap would suggest.

The "FLOPs per parameter" column is the important one to read. A flattened Linear spends each parameter once per token per step: two multiply-adds per scalar weight. A matrix-native projection uses each scalar parameter d times during the sandwich computation, because A · M is an inner product against every column of M. At d = 16 that is a 16× reuse factor, which cancels one factor of 16 from the 128× parameter gap. The other factor of 16 survives as a genuine per-step FLOP reduction. RowThenCol at d = 16 therefore costs 8× fewer FLOPs per step, not 128× fewer FLOPs per step. The parameter efficiency and the compute efficiency are the same sign but different magnitudes, and the compute efficiency is the smaller of the two.

03Results

The parameter gap at d = 16 is 128× between the RowThenCol bilinear projection and the flattened Linear equivalent. A bar chart on log scale makes the four families visible on one axis:

parameter count per projection at d=16 for four projection families, log scale — fig 3Parameters per projection at d = 16, log scale, for four projection families. Flattened Linear at 65,536, Kronecker K=8 at 4,096, Kronecker K=4 at 2,048, RowThenCol bilinear at 512. The ratio between the extremes is 128×. Counts from research/matrix-native-projections.md and research/matrix-native-operations-code.md (parameters only — per-step FLOPs gap at d = 16 is 8×, see fig 2).

Parameter allocation at the whole-model level

The per-layer efficiency compounds across the 8 thinking layers in a Matrix Thinker backbone. At d = 32 with a 50K vocab BPE tokenizer, the parameter budget breaks down roughly as:

component	params (mat_dim=32, BPE vocab)	share
Embedding tables (outer-product, 2 · V · d)	~3.2M	~63%
8 thinking layers (attention + multiplicative)	~0.2M	~4%
Output head (MultiProbeHead, K=32)	~1.6M	~31%

fig 4Parameter allocation in an 8-layer Matrix Thinker at d = 32 with a ~50K BPE vocabulary. Breakdown from matrix-thinking/H100_SETUP.md. The 8 thinking layers are ~4% of the total parameter budget. Embeddings and output head dominate because the per-layer cost is O(d²) while the embedding tables scale with vocab size.

At byte-level d = 16 the embedding table shrinks by a factor of ~100 (from 2 · 50{,}257 · 32 to 2 · 256 · 16, or about 8K parameters), and the thinking layers claim a much larger share of the total:

run	total params	thinking-layer share	notes
Run 12 (d=32, 50K BPE)	5.16M	~4%	8 layers; 2.19B OpenR1-Math tokens
Run 19 (d=16, byte-level)	218K	~33%	12 layers; 539M bytes

fig 5Two parameter allocations at different scales. Run 12 is the Round 2 Matrix Thinker reported in finding 02. Run 19 is the byte-level Matrix Thinker described in EXPERIMENT_LOG.md. The thinking-layer share rises from ~4% to ~33% as the vocabulary shrinks from 50K BPE to 256 bytes, because the per-layer cost is independent of vocab and the embedding cost scales with vocab.

The allocation argument says thinking layers get cheaper to add; it doesn't say adding them helps. A real training sweep at byte-level d = 16 tested that directly, holding total parameters roughly fixed while varying how many are thinking layers versus thought-interleaving iterations:

scatter plot: eval BPB vs total parameters for 5 real byte-level sweep configs A-E, thinking-enabled and no-thought baseline — fig 6Eval BPB vs total params for the 5 real configs (A–E) of the byte-level thought-interleaving sweep (8×H100, 3000 steps each, mat_dim = 16). Vermillion points are eval BPB with thinking enabled; sky points are the matched no-thought baseline BPB for the same checkpoint. Config E (N = 0 thoughts, 48 layers, no thinking) reaches the lowest absolute BPB (3.524) of the five — depth alone beats thought interleaving at this scale, consistent with the "just add layers" result (EXPERIMENT_LOG.md, Run 25). Source: experiment-runs/8xh100-session1/sweep_all_results.json, cross-checked against sweep_all_summaries.txt in the same directory.

key observation A RowThenCol bilinear projection uses 2 · d² parameters, and the flattened Linear equivalent uses d⁴. At d = 16 the ratio is 128×. The per-step FLOP gap is 8× at the same dimension, not 128×, because each matrix-native parameter is used d times per application. The efficiency is structural at the parameter level and partially realized at the compute level.

04Discussion

What this unlocks at small scale is that the thinking backbone can be cheap enough to disappear from the parameter budget entirely when the embedding table is large. At d = 32 with a 50K BPE vocab, the 8 thinking layers of a Matrix Thinker cost ~0.2M parameters out of ~5.16M total. Adding a ninth thinking layer costs ~25K parameters: a 0.5% increase in the total model size. The iterative-refinement loop in the Matrix Thinker further reuses the same thinking layers T = 8 times per forward pass, which amortizes the per-layer parameter cost across 8 applications for free — an iterated backbone at d = 32 with 8 shared layers effectively has 64 layer-applications at the cost of 8 layers' worth of parameters. The per-layer parameter gap is what makes that reuse pattern feasible without growing the model into the hundreds of millions of parameters.

The byte-level regime is where the gap starts to bind directly. At byte vocabulary of 256, the embedding table stops dominating, and the per-layer budget for the backbone is the binding constraint on how deep a model can be at a given total parameter count. Run 19 is a 218K-parameter byte-level model with 12 thinking layers, and the thinking layers are 33% of the model — a share that would be impossible with a flattened-Linear per-layer cost of d⁴ = 65,536 parameters per layer per projection. A 12-layer flattened-Linear backbone at d = 16 with four projections per layer would cost 12 · 4 · 65{,}536 ≈ 3.1M parameters for the backbone alone, fifteen times the total Run 19 model. The efficiency is what makes byte-level matrix thinking feasible at under 1M total parameters.

The counterpoint is that none of this argues matrix operations are better at matched compute. The per-step FLOP gap at d = 16 is 8×, and we expect memory bandwidth and kernel launch overhead to eat much of that on an H100; we have not measured the realized speedup (see Limitations). At matched FLOPs, the matrix approach loses: a later, properly FLOPs-accounted comparison (Stage G, matrix-thinking/STAGE_G_DESIGN.md) found matrix BPB 3.5552 vs vector BPB 3.2511 at genuinely matched compute, with the gap widening rather than closing at extended budget. (Run 14's original "653K TFLOPs" comparison, cited in an earlier version of this note, was retracted 2026-07-02 — that figure came from an idealized throughput-based estimate, not real analytic FLOPs accounting, and the two runs as actually executed were never FLOPs-matched in either direction; see EXPERIMENT_LOG.md's correction on the Run 14 entry. Stage G is the number that supersedes it.) The sibling notes on output head dynamics and the outer-product embedding describe where matrix operations hold up against vector baselines and where they do not. The honest statement is that matrix operations are a parameter-efficient way to express transformations on structured representations. They are useful where parameters are the scarce resource and compute is not, such as small models trained at modest scale, byte-level vocabularies, and deployment environments with a tight memory budget but ample FLOPs per parameter.

A related point about depth. If the parameter gap per layer is 128×, then at a fixed total parameter budget a matrix-thinking backbone can afford to be roughly 128× deeper than a flattened-Linear backbone at the same d. At small scale this is the trade that opens up: more layers or more iterative-refinement steps at matched total parameters. Whether the extra depth translates into downstream quality is a separate question — one the "just add layers" lesson in CLAUDE.md warns us about — and the parameter efficiency result does not settle it.

05Limitations

One matrix dimension tested. All numbers in this note are at d = 16. The parameter ratio scales as d⁴ / (2 · d²) = d² / 2, so at d = 32 the ratio would be 512× and at d = 8 it would be 32×. The 128× headline is specific to d = 16. We report it at that dimension because that is where we have the matching FLOP count from the project's earlier matrix-native-projections research. The general statement is d² / 2.
Per-step FLOP savings do not fully materialize on H100. The theoretical FLOP gap at d = 16 is 8×, but in practice the RowThenCol projection is memory-bandwidth bound rather than compute-bound, and kernel launch overhead for the small per-layer tensors eats most of the savings. We have not measured the exact realized speedup. A torch.compile-based fused kernel for the sandwich operation is a natural follow-up that might close the gap between theoretical and realized FLOP savings.
Capacity ceiling. A RowThenCol projection is a constrained subspace of the flattened Linear. Not every d² × d² linear map can be written as a sandwich with silu in the middle. The unsaturated sandwich A · M · B is equivalent to the linear map (Bᵀ ⊗ A) · vec(M), a Kronecker-structured linear operator living in a d²-parameter subspace of the full d⁴-parameter Linear space. The SiLU nonlinearity between A · M and the right multiplication by B is what breaks the pure-linear structure and lets the composition approximate a larger function class; we have not quantified how close the saturated version gets to the full Linear. The parameter efficiency comes at an expressivity cost that the numbers in this note do not quantify. Whether the constrained subspace is sufficient for language modeling at scale is what the downstream experiments in the sibling notes attempt to measure, and the current evidence is mixed: RowThenCol backbones lose to flat-vector baselines at matched FLOPs but win at the embedding layer under specific output-head conditions.
Parameter counts assume the thinking layers are shared across iterations. The ~4% thinking-layer share at d = 32 is for 8 thinking layers applied T = 8 times with weight sharing, which is how the Matrix Thinker uses them. An unshared 8-layer backbone would double the thinking-layer share; a shared backbone with T = 16 iterations would keep it at 4%. The allocation number is specific to the Matrix Thinker architecture and should not be read as a general statement about 8-layer transformers.
Single-seed downstream claims. The allocation numbers cited from Run 12 and Run 19 come from single training runs (n = 1 each). The parameter counts themselves are deterministic functions of the architecture hyperparameters and do not depend on the seed, but the comparative framing in Discussion that appeals to Stage G and the sibling notes rests on single-seed downstream evidence. We inherit the single-seed caveat from those notes.

06Future work

Three tests that would strengthen or qualify this result:

Matched-capacity comparison against structured-linear baselines. The current comparison is RowThenCol bilinear vs fully flattened Linear. The more honest baseline is a matched-capacity structured linear layer — a block-diagonal or other structured matrix family, low-rank bottlenecks of rank r with r · d² parameters, or PHM with n = 4. If RowThenCol beats those at matched parameters on a downstream metric, the structural case is stronger. If it loses, the "128× fewer than flattened Linear" framing overstates the practical efficiency gain.
Compile-time optimization of the matrix sandwich. The per-step FLOP savings at d = 16 are 8× in theory but much smaller in practice. A fused kernel for silu(A · M) · B that holds A and B in registers and streams M through would close the gap between the theoretical and realized speedup. torch.compile with fullgraph=True and fixed shapes is the cheapest way to try this, and the H100 setup already supports it.
Scale sweep on d. The parameter ratio is d² / 2, which grows quadratically with the matrix dimension. At d = 64 the ratio is 2048× per layer. At small d the parameter savings might be outweighed by the expressivity cost of the constrained subspace; at large d the savings dominate. Running the same comparison across d ∈ {8, 16, 32, 64} with the downstream task held fixed would tell us where the trade flips.

07Reproducibility

The RowThenCol projection module lives at matrix-thinking/src/matrix_thinker.py, class RowThenColProjection (lines 39–49):

class RowThenColProjection(nn.Module):
    """silu(A @ M) @ B — nonlinearity between left and right multiply."""
    def __init__(self, d):
        super().__init__()
        self.A = nn.Parameter(torch.eye(d) + 0.02 * torch.randn(d, d))
        self.B = nn.Parameter(torch.eye(d) + 0.02 * torch.randn(d, d))

    def forward(self, M):
        return torch.einsum('bsij,jk->bsik',
                            F.silu(torch.einsum('ij,bsjk->bsik', self.A, M)),
                            self.B)

Two d × d parameter tensors, 2 · d² = 512 trainable scalars at d = 16. A standard nn.Linear(256, 256) for comparison has 256 · 256 = 65{,}536 weight parameters (ignoring bias), giving the 128× ratio.

Source files for the parameter and FLOP counts used in this note:

research/matrix-native-projections.md: parameter and FLOP table at d = 16 for Flatten→Linear, Bilinear K=1, Kronecker K=4, Kronecker K=8. Internal project research note, not an external peer-reviewed source.
research/matrix-native-operations-code.md: extended parameter/FLOPs table including Householder and multi-head bilinear variants. Internal project research note, not an external peer-reviewed source.
matrix-thinking/H100_SETUP.md: parameter allocation at d = 32 in an 8-layer Matrix Thinker — embeddings ~63%, 8 thinking layers ~4%, output head ~31%.
EXPERIMENT_LOG.md: Run 19 (byte-level d = 16, 218K parameters, 12 thinking layers, ~33% thinking-layer share) and the Run 14 entry's 2026-07-02 correction (retracting the original "653K TFLOPs" comparison as unsupported). matrix-thinking/STAGE_G_DESIGN.md §14: the properly FLOPs-accounted matrix-vs-vector comparison (matrix BPB 3.5552 vs vector BPB 3.2511) that supersedes Run 14 for the matched-FLOPs claim.
pebble-ai-site/assets/plots/generate_parameter_efficiency.py: the script that produced the figure in this note. No training involved — the script renders fixed constants from the source files listed above.

References

pebble project (2026). Matrix-native projections — research results. Internal project research note at research/matrix-native-projections.md, not an external peer-reviewed source. Parameter and FLOP comparison at d = 16.
pebble project (2026). Matrix-native operations — working code and repo findings. Internal project research note at research/matrix-native-operations-code.md, not an external peer-reviewed source. Extended parameter/FLOPs table.
Zhang, A., Tay, Y., Zhang, S., Chan, A., Luu, A. T., Hui, S. C., & Fu, J. (2021). Beyond Fully-Connected Layers with Quaternions: Parameterizations of Hypercomplex Multiplications (PHM). ICLR 2021. arXiv:2102.08597