A real engineering day: 2.69× VRAM reduction, why MoE compresses tighter, and what a refuted cure looks like
Three results from one Tuesday at Sipsa Labs. Two are real wins. One is the 30th honest negative we have published. All three are visible right now in github.com/sipsalabs/ultracompress. This is the kind of day-in-the-life we wish more deep-tech teams were honest about.
Win 1: 2.69× VRAM reduction at zero compute overhead
A new streamed weight-loading path shipped today.
The headline number: Qwen3-8B inference now needs 6.67 GB VRAM instead of 17.95 GB. Same compute speed. PPL match to six decimal places against the in-VRAM baseline (11.033340 either way, on n=200 FineWeb-edu held-out tail, seq_len=1024, seed=42).
How: we replaced the layer load path from torch.load to memory-mapped safetensors with pinned host buffers + non-blocking H2D transfers. The streaming loader evicts each layer after use; the next layer’s host→device copy issues async on the side stream while the current layer runs forward on the main stream.
The interesting measurement is not the wall time. It is that the layer-streaming compute ratio is 1.00× baseline — once you peel off the fixed startup cost (embedding + first-layer init), the per-layer compute time is identical to having the layer pre-resident in VRAM. The streaming loader adds zero compute overhead.
For 8B that is a 2.69× headroom unlock. For 405B (which is what the same code path enables on a single 32 GB consumer GPU at 1.0066× PPL ratio against bf16), it is the difference between “this is impossible” and “this is what we publish.”
The implementation lives in our internal evaluation path; additional provisional filings are in preparation for deployment-architecture extensions.
Win 2: why MoE compresses tighter than dense (the mechanism)
The observation has been visible in our public registry for weeks: MoE architectures consistently land in the 1.001–1.006× PPL ratio band, dense in 1.004–1.012×. Phi-3.5-MoE at 1.00129× is the tightest single result we have across 22 verified architectures. Mistral-7B (dense) at 1.00548× and Mixtral-8x7B (MoE, same hidden dim and intermediate dim) at 1.00368× is the cleanest controlled pair.
The question was: is this real, or is it an accounting artifact (the MoE has 47B total parameters but only 13B “active” per token, so maybe the dead experts are barely exercised and their compression error does not show up in PPL)?
Today’s mechanism diagnosis ruled out the artifact and identified the actual driver.
The mechanism is structural: MoE expert FFN weights have measurably different statistical properties than dimension-matched dense FFN weights. The specifics of the property and our analysis stay private as part of the unfiled IP class.
We ruled out two common alternative explanations during the diagnosis (lower effective rank, active-parameter dilution); the actual driver is a different structural feature of the routed-token training regime.
The downstream engineering implication — and this is what makes the finding immediately useful rather than just academically interesting — is that an internal structural signal we can compute from a pretrained checkpoint is informative for ordering compression difficulty within an architecture family. The cross-family generalization is weaker than we initially expected (the across-family correlation is non-significant); the within-family signal still pays for itself in calibration. The specifics of the signal stay private; the prediction it generates is published in the registry every time we ship a new pack.
Loss 1: Llama-3.1-8B 1.0125× refuses to bend
For context: across our 22 verified architectures, the typical PPL ratio at 5 bits per weight is between 1.001 and 1.008×. Llama-3.1-8B sits at 1.0125× — the worst drift in the published set. We have been trying for weeks to understand why and to find a cure.
Today’s attempt was the 6th. It refuted.
The diagnostic: Llama-3.1-8B’s residual-stream weight distribution is dramatically more heterogeneous than the architectures that compress tightly. We identified a per-architecture structural property that explains the within-family ordering. That much heterogeneity is hard for the downstream compensation step to cover under a uniform-capacity allocation.
We tried six variants of correction strategies. Today’s attempt was killed at layer 4 of 32 by an early-stopping criterion: the reconstruction error against the bf16 reference was compounding rapidly past layer 4 and accelerating.
Mechanism: the structural heterogeneity in this architecture interacts unfavorably with the cure family we tried. The cure family we explored is structurally limited; it is plausible that a fundamentally different calibration pipeline does break this floor (an earlier calibration approach achieved a tighter ratio on the same architecture and we are still characterizing what generalizes).
The 1.0125× floor on Llama-3.1-8B is now empirically robust across six perturbations spanning three cure families. We are publishing it as a likely architectural limit, not as “more work needed.”
Why surface this? Because the published honest-negatives-to-wins ratio is the moat. Any team that claims their 5-bit compression beats AWQ on every architecture is either lying or has not tested enough architectures. We tested 22 and shipped them; one of them refuses to compress below 1.0125×. We say so out loud.
What ties the three results together
The MoE compression-tightness mechanism and the Llama-3.1-8B refutation point at the same underlying observation: weight-distribution structure drives compressibility. The cross-track synthesis today proposed two convergent cures we have not tried yet:
Two convergent cures with different priors are queued for tonight's run — both informed by per-layer signals we will describe once they ship, both different from the six already-refuted attempts. If either works on Llama-3.1-8B, the same approach should tighten Phi-4 (currently 1.005×) and possibly Mistral-7B (currently 1.00548×).
The streamed-loading unlock interacts with the mechanism finding in a different way: the lower memory footprint means we can run n=200 calibration on 8B in ~3–4 minutes instead of being memory-budget-constrained. Bigger calibration sets change the optimization landscape on the hardest architectures. The Llama-3.1-8B 6-cure-refutation was done at n=30. We have not yet tested whether n=200 changes the answer.
Both of those experiments fire on GPU1 tonight or tomorrow. Results will be in the public repo when they land. Negatives included.
What this means for buyers reading this
uc verify, run your eval harness, and the numbers reproduce exactly to the public registry. If they do not, that is a bug report we want — every number in docs/benchmarks.json has reproducible provenance.