Architecture-specific floors: why we publish negative results

/ 2026-05-12 · Sipsa Labs · ~6 min read

Sipsa Labs has now verified our lossless 5-bit transformer compression substrate across 22 architectures spanning dense (0.6B–405B), Mixture-of-Experts (47B–235B active), and state-space (Mamba-2.8B). Twenty-one of those land at sub-1.013× perplexity ratio against the bf16 reference. The remaining one — Llama-3.1-8B — plateaus at 1.0125×.

That single non-record number is more useful to our customers than any of the records.

The result: 22 architectures verified, one architecture-specific floor

Here is the matrix that ships in our public benchmark page (full machine-readable JSON at docs/BENCHMARKS_2026_05_10.json):

ArchitectureParamsClassPPL ratioStatus
Hermes-3-Llama-3.1-405B405Bdense1.0066×verified
Mixtral-8x7B47B MoEMoE1.00368×verified
Mixtral-8x22B141B MoEMoE~1.005×verified
Qwen3-14B14Bdense1.00403×verified
Qwen3-8B8Bdense1.00440×verified
Mistral-7B-v0.37Bdense1.00548×verified
Phi-3-mini-4k-instruct3.8Bdense1.0062×verified
Qwen3-1.7B-Base1.7Bdense1.0040×verified
Qwen3-0.6B0.6Bdense1.0076×verified
Mamba-2.8B2.8BSSMverifiedverified
...11 more architecturesall sub-1.013×verified
Llama-3.1-8B8Bdense1.0125×floor (publishable)

Why one number gets a different label

Most quantization libraries publish only the best per-architecture result. We do too — but for Llama-3.1-8B we ran three independent perturbation experiments from the production substrate baseline, in three completely different directions, to test whether the 1.0125× was a real architecture-specific floor or just a local minimum we hadn't escaped:

  1. Capacity perturbation: increase the per-Linear correction overlay rank by ~33% (48 → 64). Result: 1.0137×. Worse.
  2. Schedule perturbation: increase per-layer training steps by ~67% (300 → 500). Result: 1.0135×. Worse.
  3. Objective perturbation: add a held-out output-distribution agreement regularizer at lambda=0.1. Result: 1.0698×. Catastrophically worse, despite the regularizer's training-time signal converging cleanly.

Three independent perturbations, three independent failures, three different mechanisms. That converges on a real claim: 1.0125× is the substrate floor for Llama-3.1-8B at 5 bits per weight, not a tunable parameter. Further reduction requires a substrate-level change (a different bit-budget, a different codec family) — not a knob inside this substrate.

Why this matters more than another <1.005× record

Three reasons.

1. Customer trust comes from honest negative results.

Every quantization paper publishes the architecture where their method shines. The customer who tries to apply that method to their architecture finds out the hard way which architectures don't work. We publish all 22 architectures with their measured ratios, and we publish the one where the substrate plateaus, because that is what enterprise customers under SOC 2 / SR-11-7 / FDA review actually need to know.

2. Empirically-bounded floors are stronger than asymptotic claims.

"Our method is asymptotically optimal" is unfalsifiable. "We perturbed the substrate three independent ways and the PPL ratio got worse every time, so 1.0125× is the empirical floor" is falsifiable in a single run by anyone with a 32 GB GPU. The reproducibility floor is the trust floor.

3. Architecture-specific floors are a real engineering signal.

Llama-3.1-8B is in the same bf16-loss regime as our other dense architectures, but its 5-bit substrate floor is ~2.5× higher than the 7B-class Mistral floor. That is a real architecture-specific property — likely tied to GQA head density, MLP intermediate width, or pretraining-data distribution. Knowing the floor lets us choose substrate parameters per-architecture instead of globally.

What's verifiable, today, in 5 minutes

Everything in the table above is reproducible from the customer side. Pick any verified row:

pip install ultracompress
uc pull SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5
uc verify _packed_mistral-7b-v0.3-uc-v3-bpw5
uc bench _packed_mistral-7b-v0.3-uc-v3-bpw5

The uc verify command pins SHA-256 over the reconstructed model state and confirms bit-identical reconstruction against the published manifest. Different from AWQ / GPTQ in that the codec state and the per-Linear correction overlay together pin the output distribution to a verifiable upper bound.

What we do next

The 21 architectures already in production stay there. The Llama-3.1-8B floor at 1.0125× is what ships, marked as the verified architecture-specific minimum. Future research targets sit at the substrate level (mixed-precision allocation, trellis codec experimentation), not at parameter-tuning the existing substrate against Llama.

If you are evaluating compressed-model substrates against your own architecture stack, our benchmark page shows the verified PPL ratios per architecture, and Phase 0 POCs are 1 week, $5K–$25K, on the architecture of your choice.

← back to blog