research note  ·  finding 04

Per-layer parameter efficiency of matrix projections

structural observation · one matrix dimension tested

Sam Larson

pebble, San Francisco

April 7, 2026  ·  sam@pebbleml.com

abstract

At matrix dimension d = 16, a RowThenCol bilinear projection — silu(A · M) · B with A, B ∈ ℝd×d — uses 2·d² = 512 parameters. The flattened equivalent — reshape M to a -dim vector and apply a d² × d² Linear layer — uses d⁴ = 65,536 parameters. The ratio is 128×. The same structural argument gives 32× at Kronecker K = 4 and 16× at Kronecker K = 8. This is a per-layer claim at one matrix dimension. It is not a claim that matrix operations are better at matched compute: the RowThenCol projection is O(d³) per step while the flattened Linear is O(d⁴) only because the layer carries d⁴ parameters, and the per-parameter efficiency comes at a per-step FLOP constant we expect memory bandwidth and kernel launch overhead to eat much of on an H100; we have not measured the realized speedup (see Limitations). The allocation consequence at our scale is that in an 8-layer Matrix Thinker at d = 32 the 8 thinking layers occupy ~4% of the parameter budget while embeddings and output head occupy the rest; at byte-level d = 16 (Run 19, 218K total parameters) the thinking layers expand to ~33% of the budget as the embedding table shrinks. The structural efficiency is useful where parameters are scarce and compute is not the binding constraint.

01Background

Per-layer parameter efficiency has a long history in structured linear algebra. Tensor decompositions cut parameters by factoring a weight tensor into a product of smaller tensors. Low-rank adaptation approaches restrict gradient updates to a rank-r subspace of a full weight matrix. Block-diagonal and other structured matrix families cut parameters by imposing structural sparsity. Parameterized Hypercomplex Multiplication (PHM) from Zhang et al. 2021 [3] expresses a linear layer as a sum of Kronecker products and reduces parameters by a factor of n at cost of an n-way Kronecker sum. Bilinear pooling from the fine-grained visual recognition literature uses outer products to produce second-order features with far fewer parameters than the full quadratic expansion would need.

The common structure across these methods is that a linear map from m to n that would cost m · n parameters is replaced with a factored form whose parameter count scales sub-linearly in m · n. The catch is that the factoring constrains the map: not every m × n matrix can be represented as a low-rank product or a sum of Kronecker terms, so there is a capacity ceiling the full dense layer does not have. The useful regime is the one where the constrained class is expressive enough for the task and the parameter budget is the binding constraint.

Matrix-valued token representations change the unit of computation from a d-dim vector to a d × d matrix. The natural "linear layer" in this setting is a map from one d × d matrix to another. If the matrix is immediately flattened, the layer becomes a standard d² × d² dense Linear with d⁴ parameters. If the matrix is kept as a matrix and the layer is built from left- and right-multiplications, the layer uses two d × d parameter tensors for 2 · d² total parameters per projection. The two are not interchangeable maps — the second is a constrained subspace of the first — but the second is what the rest of the Matrix Thinker stack is already wired to consume through row-wise and column-wise attention, RowThenCol projections, and bilinear output heads. The question this note addresses is how large the per-layer parameter gap is and what the allocation consequence looks like when the gap is applied to an 8-layer stack.

02Setup

Parameter and FLOP counts in this note come from two internal project research notes (research/matrix-native-projections.md and research/matrix-native-operations-code.md); these are internal project research notes, not external peer-reviewed sources. FLOP counts in this note use the 2-FLOP-per-fused-multiply-add convention from research/matrix-native-projections.md; the alternative 1-FLOP-per-FMA convention (used in research/matrix-native-operations-code.md) gives half the absolute numbers but preserves the 8× ratio.

The two projections

We compare two ways to realize a learned map from a d × d input matrix M to a d × d output matrix, at d = 16:

For completeness we also include two Kronecker variants from the project's earlier research on matrix-native projections:

Parameter counts

The four numbers are fixed by the projection form and the matrix dimension. They are not run-dependent:

projection form params at d=16 vs flattened
Flattened Linear d² × d² dense 65,536 1.0×
Kronecker K=8 Σk=1..8 Ak · M · Bk 4,096 16× fewer
Kronecker K=4 Σk=1..4 Ak · M · Bk 2,048 32× fewer
RowThenCol bilinear silu(A · M) · B 512 128× fewer
fig 1Parameter count per projection at d = 16 for four projection families. Counts from research/matrix-native-projections.md and research/matrix-native-operations-code.md. The "vs flattened" column is 65,536 / params. The headline ratio for RowThenCol is 128×.

Per-step FLOPs, so nothing is hidden

The parameter efficiency does not come for free. At d = 16 the per-step FLOP counts for the same four projections are:

projection params flops (fwd) flops / param
Flattened Linear 65,536 131,072 2
Kronecker K=8 4,096 131,072 32
Kronecker K=4 2,048 65,536 32
RowThenCol bilinear 512 16,384 32
fig 2FLOPs per projection per token at d = 16. Counts from research/matrix-native-projections.md. A flattened Linear has FLOPs ≈ 2 × params (one multiply-add per parameter per token). Matrix-native projections apply each parameter O(d) times in the inner loop, so FLOPs / param rises from 2 to 32 at d = 16. The RowThenCol bilinear uses 8× fewer FLOPs than the flattened Linear per step, but that is a weaker speedup than the 128× parameter gap would suggest.

The "FLOPs per parameter" column is the important one to read. A flattened Linear spends each parameter once per token per step: two multiply-adds per scalar weight. A matrix-native projection uses each scalar parameter d times during the sandwich computation, because A · M is an inner product against every column of M. At d = 16 that is a 16× reuse factor, which cancels one factor of 16 from the 128× parameter gap. The other factor of 16 survives as a genuine per-step FLOP reduction. RowThenCol at d = 16 therefore costs 8× fewer FLOPs per step, not 128× fewer FLOPs per step. The parameter efficiency and the compute efficiency are the same sign but different magnitudes, and the compute efficiency is the smaller of the two.

03Results

The parameter gap at d = 16 is 128× between the RowThenCol bilinear projection and the flattened Linear equivalent. A bar chart on log scale makes the four families visible on one axis:

parameter count per projection at d=16 for four projection families, log scale
fig 3Parameters per projection at d = 16, log scale, for four projection families. Flattened Linear at 65,536, Kronecker K=8 at 4,096, Kronecker K=4 at 2,048, RowThenCol bilinear at 512. The ratio between the extremes is 128×. Counts from research/matrix-native-projections.md and research/matrix-native-operations-code.md (parameters only — per-step FLOPs gap at d = 16 is 8×, see fig 2).

Parameter allocation at the whole-model level

The per-layer efficiency compounds across the 8 thinking layers in a Matrix Thinker backbone. At d = 32 with a 50K vocab BPE tokenizer, the parameter budget breaks down roughly as:

component params (mat_dim=32, BPE vocab) share
Embedding tables (outer-product, 2 · V · d) ~3.2M ~63%
8 thinking layers (attention + multiplicative) ~0.2M ~4%
Output head (MultiProbeHead, K=32) ~1.6M ~31%
fig 4Parameter allocation in an 8-layer Matrix Thinker at d = 32 with a ~50K BPE vocabulary. Breakdown from matrix-thinking/H100_SETUP.md. The 8 thinking layers are ~4% of the total parameter budget. Embeddings and output head dominate because the per-layer cost is O(d²) while the embedding tables scale with vocab size.

At byte-level d = 16 the embedding table shrinks by a factor of ~100 (from 2 · 50{,}257 · 32 to 2 · 256 · 16, or about 8K parameters), and the thinking layers claim a much larger share of the total:

run total params thinking-layer share notes
Run 12 (d=32, 50K BPE) 5.16M ~4% 8 layers; 2.19B OpenR1-Math tokens
Run 19 (d=16, byte-level) 218K ~33% 12 layers; 539M bytes
fig 5Two parameter allocations at different scales. Run 12 is the Round 2 Matrix Thinker reported in finding 02. Run 19 is the byte-level Matrix Thinker described in EXPERIMENT_LOG.md. The thinking-layer share rises from ~4% to ~33% as the vocabulary shrinks from 50K BPE to 256 bytes, because the per-layer cost is independent of vocab and the embedding cost scales with vocab.
key observation A RowThenCol bilinear projection uses 2 · d² parameters, and the flattened Linear equivalent uses d⁴. At d = 16 the ratio is 128×. The per-step FLOP gap is at the same dimension, not 128×, because each matrix-native parameter is used d times per application. The efficiency is structural at the parameter level and partially realized at the compute level.

04Discussion

What this unlocks at small scale is that the thinking backbone can be cheap enough to disappear from the parameter budget entirely when the embedding table is large. At d = 32 with a 50K BPE vocab, the 8 thinking layers of a Matrix Thinker cost ~0.2M parameters out of ~5.16M total. Adding a ninth thinking layer costs ~25K parameters: a 0.5% increase in the total model size. The iterative-refinement loop in the Matrix Thinker further reuses the same thinking layers T = 8 times per forward pass, which amortizes the per-layer parameter cost across 8 applications for free — an iterated backbone at d = 32 with 8 shared layers effectively has 64 layer-applications at the cost of 8 layers' worth of parameters. The per-layer parameter gap is what makes that reuse pattern feasible without growing the model into the hundreds of millions of parameters.

The byte-level regime is where the gap starts to bind directly. At byte vocabulary of 256, the embedding table stops dominating, and the per-layer budget for the backbone is the binding constraint on how deep a model can be at a given total parameter count. Run 19 is a 218K-parameter byte-level model with 12 thinking layers, and the thinking layers are 33% of the model — a share that would be impossible with a flattened-Linear per-layer cost of d⁴ = 65,536 parameters per layer per projection. A 12-layer flattened-Linear backbone at d = 16 with four projections per layer would cost 12 · 4 · 65{,}536 ≈ 3.1M parameters for the backbone alone, fifteen times the total Run 19 model. The efficiency is what makes byte-level matrix thinking feasible at under 1M total parameters.

The counterpoint is that none of this argues matrix operations are better at matched compute. The per-step FLOP gap at d = 16 is , and we expect memory bandwidth and kernel launch overhead to eat much of that on an H100; we have not measured the realized speedup (see Limitations). At matched FLOPs, the matrix approach loses by a large margin (EXPERIMENT_LOG.md Run 14: Matrix Thinker BPB 1.67 vs LoopFormer BPB 0.87 at the same 653K TFLOPs budget). The sibling notes on output head dynamics and the outer-product embedding describe where matrix operations hold up against vector baselines and where they do not. The honest statement is that matrix operations are a parameter-efficient way to express transformations on structured representations. They are useful where parameters are the scarce resource and compute is not, such as small models trained at modest scale, byte-level vocabularies, and deployment environments with a tight memory budget but ample FLOPs per parameter.

A related point about depth. If the parameter gap per layer is 128×, then at a fixed total parameter budget a matrix-thinking backbone can afford to be roughly 128× deeper than a flattened-Linear backbone at the same d. At small scale this is the trade that opens up: more layers or more iterative-refinement steps at matched total parameters. Whether the extra depth translates into downstream quality is a separate question — one the "just add layers" lesson in CLAUDE.md warns us about — and the parameter efficiency result does not settle it.

05Limitations

06Future work

Three tests that would strengthen or qualify this result:

  1. Matched-capacity comparison against structured-linear baselines. The current comparison is RowThenCol bilinear vs fully flattened Linear. The more honest baseline is a matched-capacity structured linear layer — a block-diagonal or other structured matrix family, low-rank bottlenecks of rank r with r · d² parameters, or PHM with n = 4. If RowThenCol beats those at matched parameters on a downstream metric, the structural case is stronger. If it loses, the "128× fewer than flattened Linear" framing overstates the practical efficiency gain.
  2. Compile-time optimization of the matrix sandwich. The per-step FLOP savings at d = 16 are in theory but much smaller in practice. A fused kernel for silu(A · M) · B that holds A and B in registers and streams M through would close the gap between the theoretical and realized speedup. torch.compile with fullgraph=True and fixed shapes is the cheapest way to try this, and the H100 setup already supports it.
  3. Scale sweep on d. The parameter ratio is d² / 2, which grows quadratically with the matrix dimension. At d = 64 the ratio is 2048× per layer. At small d the parameter savings might be outweighed by the expressivity cost of the constrained subspace; at large d the savings dominate. Running the same comparison across d ∈ {8, 16, 32, 64} with the downstream task held fixed would tell us where the trade flips.

07Reproducibility

The RowThenCol projection module lives at matrix-thinking/src/matrix_thinker.py, class RowThenColProjection (lines 39–49):

class RowThenColProjection(nn.Module):
    """silu(A @ M) @ B — nonlinearity between left and right multiply."""
    def __init__(self, d):
        super().__init__()
        self.A = nn.Parameter(torch.eye(d) + 0.02 * torch.randn(d, d))
        self.B = nn.Parameter(torch.eye(d) + 0.02 * torch.randn(d, d))

    def forward(self, M):
        return torch.einsum('bsij,jk->bsik',
                            F.silu(torch.einsum('ij,bsjk->bsik', self.A, M)),
                            self.B)

Two d × d parameter tensors, 2 · d² = 512 trainable scalars at d = 16. A standard nn.Linear(256, 256) for comparison has 256 · 256 = 65{,}536 weight parameters (ignoring bias), giving the 128× ratio.

Source files for the parameter and FLOP counts used in this note:

References

  1. pebble project (2026). Matrix-native projections — research results. Internal project research note at research/matrix-native-projections.md, not an external peer-reviewed source. Parameter and FLOP comparison at d = 16.
  2. pebble project (2026). Matrix-native operations — working code and repo findings. Internal project research note at research/matrix-native-operations-code.md, not an external peer-reviewed source. Extended parameter/FLOPs table.
  3. Zhang, A., Tay, Y., Zhang, S., Chan, A., Luu, A. T., Hui, S. C., & Fu, J. (2021). Beyond Fully-Connected Layers with Quaternions: Parameterizations of Hypercomplex Multiplications (PHM). ICLR 2021. arXiv:2102.08597