Eighteen architectures, one pack format.

UltraCompress 5-bit lossless transformer compression validated across 18 architectures from 0.6B to 405B parameters — dense, mixture-of-experts, and state-space (Mamba). One pack format. Bit-identical reconstruction guaranteed by SHA-256 manifest at customer load.

UltraCompress · 2026-05-08 · Posted by the Sipsa Labs team

If you compress a transformer with AWQ, GPTQ, EXL3, or bitsandbytes today, the model your customer runs is not the model you measured. It is a close relative. The perplexity will be 3-10% off from what you saw at training time. Most teams shrug and ship it.

We do not get to shrug. The customers who actually pay premium dollars for compression — defense primes, hospital networks, banks running on-device LLMs for advisor seats — need bit-exact behavior between the artifact they audit and the artifact running in production. “Close enough” is a compliance blocker, not an engineering note.

This post explains where we are after a week of architecture validation: what numbers we have, what we do not have, and where it goes next.

The drift nobody at the major labs admits

Quantization libraries today share one assumption: at inference time, the customer rebuilds the dequantized weights from the stored low-bit representation through a different code path than the trainer used. The customer-side reconstruction is then approximately equal to the trainer’s measured weights. Approximately. The two paths drift because of:

numerical order-of-operations differences between the trainer’s calibration and the customer’s bit-unpack
different fp16 vs bf16 vs fp32 intermediate dtypes
per-group scale rounding that did not exist at training time
the customer’s dequantize kernel being a slightly different mathematical commitment than the trainer’s

The drift is small per-weight. It compounds. The output perplexity drift on 5-bit transformer weights is generally 3-10% across the major libraries. On regulated customer hardware, that drift is the difference between “model behaves as audited” and “we cannot ship this.”

You cannot trivially fix this in AWQ or GPTQ without changing the wire format — and once you change the wire format, you have invented a new format. Which is what we did.

The insight: persist the trainer’s codec

The thing that bothered us for months: the trainer literally has the right answer in memory. It already has the per-block scales. It already has the integer codes. Then we throw all three away and write the dequantized weights to disk as a bf16 tensor.

Why? Because the convention of every prior library is that the customer-facing artifact is a state_dict of full-precision-ish weights. So you bake in the dequantize.

If instead you ship the customer the codec state itself — the explicit non-uniform learned representation, the per-block scales, the bit-packed integer codes — and have the customer reconstruct on load, then customer-side reconstruction is the same arithmetic operation the trainer performed. Bit identical. Not approximately. Identically.

We measured Qwen3-1.7B compressed perplexity at the trainer side: 18.3748. We then packed the model with uc pack v3, uploaded it to HuggingFace, downloaded it on a clean machine through the customer flow, ran uc verify, and re-measured perplexity: 18.3748. The delta is the precision of printf, not the precision of the format. We also verified bit-identity at the state_dict level: every quantized weight tensor reconstructs with max_abs_diff = 0.0 against the trainer’s saved post-quantize state.

The compression vs the bf16 baseline is still lossy — this is 5 bits per weight, not magic. But the trainer-to-customer step is no longer a source of drift.

The eighteen-architecture matrix

We did not want to ship a single proof point. The release covers eighteen architectures from three vendors, spanning 0.6B to 405B parameters, including four Mixture-of-Experts variants and one state-space model. All are live on HuggingFace under BUSL-1.1 with research / sub-$1M ARR / individual use grant.

Model	Params	Type	PPL ratio
Phi-3-mini-4k-instruct	3.8B	dense	1.00262x
Mixtral-8x7B-v0.1	47B	MoE	1.00368x
Qwen3-1.7B-Base	1.7B	dense	1.00401x
Qwen3-14B	14B	dense	1.00403x
Yi-1.5-9B	8.8B	dense	1.00414x
Qwen3-8B	8B	dense	1.00440x
Mistral-7B-v0.3	7B	dense	1.00548x
Phi-3-mini-4k	3.8B	dense	1.00624x
Hermes-3-Llama-3.1-405B	405B	dense	1.0066x
Qwen3-0.6B	0.6B	dense	1.0069x
OLMo-2-0425-1B	1B	dense	1.0073x
SmolLM2-1.7B-Instruct	1.7B	dense	1.0075x
SmolLM2-1.7B	1.7B	dense	1.0085x
Mamba-2.8B	2.8B	SSM	1.0119x
Llama-3.1-8B	8B	dense	1.0125x
Llama-3.1-70B	70B	dense	measured; see card
Phi-3.5-MoE-instruct	42B	MoE	measured; see card
TinyLlama-1.1B-Chat	1.1B	dense	measured; see card

The honest framing: most architectures land sub-1% perplexity drift at 5 bits per weight. The 8B-class records (Qwen3-8B 1.00440x, Mixtral-8x7B 1.00368x) are class-leading among public 5-bit results we have been able to find. Mistral-7B-v0.3 was the family that took the longest to dial in: the original streaming runner sat near 1.05x for several iterations before a tightened training objective dropped it to 1.0055x this week.

The cross-architecture surprise: it works on Mamba

State-space models (Mamba, Mamba-2, RWKV, Jamba) do not use transformer attention. The literature on quantizing them at sub-4-bit is essentially empty. AWQ does not target them. GPTQ does not target them. EXL3 does not target them.

But the pack format is a property of the underlying linear-algebra operation on the dense nn.Linear modules — and Mamba blocks contain four dense Linears each across 64 blocks. So it should just work. It does. We compressed all 256 SSM Linears in state-spaces/mamba-2.8b-hf at 5 bits per weight with the same trainer used for transformers. Bit-identical reconstruction verified per Linear. End-to-end perplexity ratio: 1.0119x.

That is, as far as we can find published, the first ultra-low-bit compression result on a state-space model architecture. It strongly implies the same approach extends to Mamba-2, RWKV, and Jamba hybrid transformer/SSM. Those are queued for the next runner adapter.

The streaming pipeline: 70B on a single 32GB GPU

The reason all eighteen models compressed on the same hardware is the per-layer streaming design. AWQ and GPTQ both want the entire model resident on GPU during compression. That is why most public AWQ checkpoints stop somewhere around 13B for consumer-class compression rigs.

UltraCompress processes one transformer block at a time. Peak GPU memory during compression is bounded by approximately one transformer block plus calibration activations, regardless of total model size. For Llama-3.1-70B that is roughly 2-3 GB during the compress step, comfortably inside a single 32 GB consumer card.

Llama-3.1-70B compressed in 12 hours on a single RTX 5090. Mixtral-8x7B in 4 hours on the same card. Hermes-3-Llama-3.1-405B compressed end-to-end on dual RTX 5090s in 13 hours.

What this unlocks for customers

The reason this matters commercially is the audit-trail story. Defense, healthcare, and regulated finance customers have a category of need that the open-source LLM stack does not currently serve: they need to demonstrate to an auditor that the model running on their hardware produces the same outputs the vendor measured at acceptance test. Today, with AWQ or GPTQ in the stack, that demonstration fails because the trainer-to-customer drift is real and measurable. Workaround: ship full-precision bf16. Cost: 4× the disk and inference memory of a 5-bit compressed model.

UltraCompress closes that gap. The customer reload is bit-identical to the trainer measurement. The vendor can sign an acceptance test against the compressed artifact, hand the same artifact to the customer, and the customer can reproduce the test bit-exactly on their own hardware.

Quick start

pip install ultracompress
hf download SipsaLabs/qwen3-1.7b-uc-v3-bpw5 --local-dir ./qwen3-1.7b-uc
uc verify ./qwen3-1.7b-uc

uc verify walks every layer, sha256-checks the pack, reconstructs the quantized Linears, and confirms shape integrity. On a passing artifact it prints VERIFY: PASS.