Eighteen architectures, one pack format.
UltraCompress 5-bit near-lossless transformer compression validated across 18 architectures from 0.6B to 405B parameters — dense, mixture-of-experts, and state-space (Mamba). One pack format. Reproducible reconstruction, verified by an SHA-256 manifest at customer load. (As of May 2026; the verified set now stands at 23 architectures.)
If you compress a transformer with AWQ, GPTQ, EXL3, or bitsandbytes today, the model your customer runs is not the model you packed. It is a close relative. The perplexity will be 3-10% off from the artifact you validated. Most teams shrug and ship it.
We do not get to shrug. The customers who actually pay premium dollars for compression — defense primes, hospital networks, banks running on-device LLMs for advisor seats — need bit-exact behavior between the artifact they audit and the artifact running in production. “Close enough” is a compliance blocker, not an engineering note.
This post explains where we are after a week of architecture validation: what numbers we have, what we do not have, and where it goes next.
The drift nobody at the major labs admits
Quantization libraries today share one assumption: at inference time, the customer rebuilds the dequantized weights from the stored low-bit representation through a different code path than the one that produced the artifact. The customer-side reconstruction is then approximately equal to the original weights. Approximately. The two paths drift because of:
- numerical order-of-operations differences between pack time and the customer’s bit-unpack
- different fp16 vs bf16 vs fp32 intermediate dtypes
- per-group scale rounding that did not exist when the artifact was produced
- the customer’s dequantize kernel being a slightly different mathematical commitment than the one used at pack time
The drift is small per-weight. It compounds. The output perplexity drift on 5-bit transformer weights is generally 3-10% across the major libraries. On regulated customer hardware, that drift is the difference between “model behaves as audited” and “we cannot ship this.”
You cannot trivially fix this in AWQ or GPTQ without changing the wire format — and once you change the wire format, you have invented a new format. Which is what we did.
The insight: a self-contained pack format
The thing that bothered us for months: the exact reference values exist at pack time. Then the conventional pipeline discards them and writes dequantized weights to disk as a bf16 tensor.
Why? Because the convention of every prior library is that the customer-facing artifact is a state_dict of full-precision-ish weights. So you bake in the dequantize.
If instead the customer-side reconstruction is a deterministic function of a self-contained pack — one that carries its own quantizer state — the result is reproducible, byte-for-byte the values recorded at pack time. Not approximately. Identically. (The format is proprietary near-lossless 5-bit compression, patent-pending.)
We measured Qwen3-1.7B compressed perplexity at pack time: 18.3748. We then ran our internal pack-format-v3 build, uploaded it to HuggingFace, downloaded it on a clean machine through the customer flow, confirmed download integrity with uc verify, and — as a separate evaluation step — re-measured perplexity: 18.3748. The delta is the precision of printf, not the precision of the format. We also verified byte-identity at the state_dict level: every reconstructed weight tensor is byte-for-byte the reference values recorded at pack time under a per-tensor SHA-256 manifest check.
The compression vs the bf16 baseline is still lossy — this is 5 bits per weight, not magic. But the pack-to-customer step is no longer a source of drift.
The eighteen-architecture matrix
We did not want to ship a single proof point. The release covers eighteen architectures from three vendors, spanning 0.6B to 405B parameters, including four Mixture-of-Experts variants and one state-space model. All are live on HuggingFace under BUSL-1.1 with research / sub-$1M ARR / individual use grant.
| Model | Params | Type | PPL ratio |
|---|---|---|---|
| Phi-3-mini-4k-instruct | 3.8B | dense | 1.00262x |
| Mixtral-8x7B-v0.1 | 47B | MoE | 1.00368x |
| Qwen3-1.7B-Base | 1.7B | dense | 1.00401x |
| Qwen3-14B | 14B | dense | 1.00403x |
| Yi-1.5-9B | 8.8B | dense | 1.00414x |
| Qwen3-8B | 8B | dense | 1.00440x |
| Mistral-7B-v0.3 | 7B | dense | 1.00548x |
| Phi-3-mini-4k | 3.8B | dense | 1.00262x (seq_len=128) |
| Hermes-3-Llama-3.1-405B | 405B | dense | 1.0066x |
| Qwen3-0.6B | 0.6B | dense | 1.0069x |
| OLMo-2-0425-1B | 1B | dense | 1.0073x |
| SmolLM2-1.7B-Instruct | 1.7B | dense | 1.0075x |
| SmolLM2-1.7B | 1.7B | dense | 1.0085x |
| Mamba-2.8B | 2.8B | SSM | verification pending |
| Llama-3.1-8B | 8B | dense | 1.0125x |
| Llama-3.1-70B | 70B | dense | measured; see card |
| Phi-3.5-MoE-instruct | 42B | MoE | measured; see card |
| TinyLlama-1.1B-Chat | 1.1B | dense | measured; see card |
The honest framing: most architectures land sub-1% perplexity drift at 5 bits per weight. The 8B-class records (Qwen3-8B 1.00440x, Mixtral-8x7B 1.00368x) are class-leading among public 5-bit results we have been able to find. Mistral-7B-v0.3 was the family that took the longest to dial in: the original streaming runner sat near 1.05x for several iterations before a tightened compression configuration dropped it to 1.0055x this week.
The cross-architecture surprise: it works on Mamba
State-space models (Mamba, Mamba-2, RWKV, Jamba) do not use transformer attention. The literature on quantizing them at sub-4-bit is essentially empty. AWQ does not target them. GPTQ does not target them. EXL3 does not target them.
But the pack format is a property of the underlying linear-algebra operation on the dense nn.Linear modules — and Mamba blocks contain four dense Linears each across 64 blocks. So it should just work. It does. We compressed all 256 SSM Linears in state-spaces/mamba-2.8b-hf at 5 bits per weight with the same pack format used for transformers, and confirmed byte-identical reconstruction per Linear via the SHA-256 manifest. The end-to-end perplexity ratio is still pending — Mamba-2.8B is compression-validated but not yet in the PPL-verified set, and we do not publish a ratio until it round-trips through our benchmark harness.
This is an early ultra-low-bit compression result on a state-space architecture, and it suggests the same approach extends to Mamba-2, RWKV, and Jamba hybrid transformer/SSM. Those are queued for the next runner adapter, and the Mamba PPL number publishes when the eval clears — not before.
The streaming pipeline: 70B on a single 32GB GPU
The reason all eighteen models compressed on the same hardware is the per-layer streaming design. AWQ and GPTQ both want the entire model resident on GPU during compression. That is why most public AWQ checkpoints stop somewhere around 13B for consumer-class compression rigs.
UltraCompress processes one transformer block at a time. Peak GPU memory during compression is bounded by approximately one transformer block plus calibration activations, regardless of total model size. For Llama-3.1-70B that is roughly 2-3 GB during the compress step, comfortably inside a single 32 GB consumer card.
Llama-3.1-70B compressed in 12 hours on a single RTX 5090. Mixtral-8x7B in 4 hours on the same card. Hermes-3-Llama-3.1-405B compressed end-to-end on dual RTX 5090s in 13 hours.
What this unlocks for customers
The reason this matters commercially is the audit-trail story. Defense, healthcare, and regulated finance customers have a category of need that the open-source LLM stack does not currently serve: they need to demonstrate to an auditor that the model running on their hardware produces the same outputs the vendor measured at acceptance test. Today, with AWQ or GPTQ in the stack, that demonstration fails because the compression-to-customer drift is real and measurable. Workaround: ship full-precision bf16. Cost: 4× the disk and inference memory of a 5-bit compressed model.
UltraCompress closes that gap. The customer reload is a deterministic, SHA-256-verifiable reconstruction of the validated artifact. The vendor can sign an acceptance test against the compressed artifact, hand the same artifact to the customer, and the customer can reproduce the reconstruction on their own hardware with byte-for-byte identical results.
Quick start
pip install ultracompress
hf download SipsaLabs/qwen3-1.7b-uc-v3-bpw5 --local-dir ./qwen3-1.7b-uc
uc verify ./qwen3-1.7b-uc
uc verify walks every layer, sha256-checks the pack, and confirms the declared layer count and per-file integrity. On a passing artifact it prints VERIFY: PASS.