We searched for our competition. The 5-bit band was empty.

A live HuggingFace Hub query for lossless 5-bit transformer compression returned zero competing artifacts. Twenty-two architectures, three new sub-1.005x perplexity ratios this week, SHA-256 verifiable bit-identical reconstruction. Here is why the band was empty — and why “lossless 5-bit” is a different category from “4-bit AWQ minus a bit.”

UltraCompress · HF Hub snapshot 2026-05-09 · Posted by the Sipsa Labs team

The cleanest signal you can get about a market is the one you have to do nothing to obtain. We did not set out to write this post. We set out to do due diligence on our own positioning — running the same competitive search any cold-call prospect would run before deciding whether to talk to us. We searched HuggingFace for “5-bit lossless transformer compression” the way a customer would. The Hub returned nothing. We re-ran the query three different ways and got the same answer. That answer is what this post is about.

The headline finding: as of 2026-05-09, the public HuggingFace Hub contains zero 5-bit lossless transformer compression artifacts that are not ours. The mainstream quantization ecosystem — AWQ, GPTQ, GGUF, EXL3 — settled at 4 bits per weight as the practical floor for “good enough” perplexity drift, and at 8 bits as the quality-first ceiling. The 5-bit band fell through the crack between those two answers. The pack format we are shipping today writes into that gap.

The search that started this post

We ran four queries against the HuggingFace Hub on the morning of 2026-05-09. The point of writing them down is so anyone can reproduce them in their own browser and confirm the result before deciding whether to take the rest of this post seriously.

QueryModels foundNotes
5bit quantization1Single irrelevant whisper.cpp speech model from 2024-03; 0 downloads, 0 trending.
5-bit lossless compression transformer0Empty.
AWQ 5bit OR GPTQ 5bit OR EXL30Empty.
SipsaLabs20All ours.

We also looked manually at the top 50 results for quantized and quantization as catch-alls. Nothing in either list was a 5-bit lossless transformer compression artifact — the lists are dense with 4-bit AWQ, 4-bit GPTQ, and a long tail of GGUF Q4_K_M / Q5_K_M / Q8_0 quantizations. Q5_K_M in GGUF naming is a lossy 5-bit-ish llama.cpp quantization; it is the closest neighbor in name; it is not in the same category.

The relevant disclaimer up front: this is a snapshot of the public Hub. Private 5-bit work almost certainly exists at the frontier labs. Our claim is not “first ever 5-bit.” Our claim is “first published, reproducible, customer-installable, bit-identically-verifiable 5-bit lossless framework.” That is a longer mouthful and a more defensible one.

Why 5 bits was the missing band

The structure of the public quantization ecosystem explains the gap better than any amount of marketing. Quantization research over the last two years bifurcated into two answers to a single question: how few bits can we get away with?

Group one — AWQ, GPTQ, OmniQuant, the more recent QTIP and SeedLM — pushed downward from fp16 toward the practical floor of model usability. AWQ ate the market at 4 bits. GPTQ at 4 bits is the broad fallback. Anything below 4 bits became research turf: 3 bits with significant quality degradation, 2 bits as a sport. The lower bound competition is what publishes papers. The lower bound is also where the engineering risk lives.

Group two — the production deployment community — settled on 8 bits as the quality-first floor. Activation-aware bf16 stays the gold standard; int8 weight-only quantization is the conservative deployment posture. The upper bound is where the audit trail lives. It is also where the storage savings stop being interesting.

That left 5, 6, and 7 bits as the negative space between those two answers. The market signal said: if you wanted maximum compression you went to 4 bits and tolerated the lossiness; if you wanted preservation you went to 8 bits and tolerated the storage. Five bits was assumed to be a wasted compromise — lossier than 8, larger than 4. Nobody asked the more interesting question: what if the trainer-side quantization can be persisted as a codec, so the customer reconstructs the same bits the trainer measured?

Lossless is not a smaller number. It is a different contract.

The vocabulary problem here is real and worth slowing down for. “Lossless” in the quantization literature usually means “lossless within the quantization grid.” That is, the quantized weights round-trip through their codes without further loss. It does not mean the quantized model matches the bf16 baseline.

Our use of “lossless” is narrower and more specific. The pack format persists the trainer’s codec state alongside its outputs, so the customer’s reconstruction is a deterministic per-Linear function of bytes-on-disk that runs the same arithmetic the trainer ran. SHA-256 over the reconstructed tensor bytes is computed at pack write and re-checked at consumer load. uc verify reports per-layer pass/fail and does not require a GPU.

The compression step (bf16 weights to 5 bits-per-weight plus correction) is lossy by construction. The distribution step (trainer’s compressed artifact to customer’s reconstructed weights) is bit-identical. That is the contract.

Why does this matter beyond the engineering aesthetics? Because in regulated environments — on-prem, air-gapped, audit-bound — the gap between “model we measured” and “model we shipped” is itself a compliance object. Every prior quantization scheme treats that gap as small enough to ignore. We treat it as zero, provably.

Twenty-two architectures, three new sub-1.005x records this week

Filling an empty band on the Hub is interesting only if the artifacts in it are good. Per-layer telemetry written by the streaming runner, aggregated into the perplexity-eval JSON outputs, gives us the ground-truth ratios below. Every entry has been cross-checked against its source JSON within ±0.0005 tolerance. The numbers that landed this week are highlighted.

ModelFamilyParamsPPL ratioConditions
Phi-3-mini-4k-instructphi3 dense3.8 B1.00262xseq_len=128 caveat
Mixtral-8x7BMoE47 B1.00368xNEW — class-leading MoE record
Qwen3-1.7B-Baseqwen3 dense1.7 B1.00401xn=30, seq=1024
Qwen3-14Bqwen3 dense14 B1.00403xNEW — ties small-decoder record at 14B class
Yi-1.5-9Bllama dense8.8 B1.00414xn=50, seq=1024 (>8B record)
Qwen3-8Bqwen3 dense8 B1.00440xNEW — 8B class record
Mistral-7B-v0.3mistral dense7 B1.00548xv10, breakthrough this week
Phi-3-mini-4kphi3 dense3.8 B1.00624xv10 cross-arch confirm
Hermes-3-405Bllama dense405 B1.0066xverified large-decoder result
Qwen3-0.6Bqwen3 dense0.6 B1.0069xstandard eval
OLMo-2-0425-1Bolmo dense1 B1.0073xstandard eval
SmolLM2-1.7B-Instructllama dense1.7 B1.0075xstandard eval
Mamba-2.8BSSM2.8 B1.0119xSSM cross-arch result
Llama-3.1-8Bllama dense8 B1.0125xstandard eval

The honest framing: many of these architectures land sub-1% perplexity drift at 5 bits per weight. Not all of them. Mixtral-8x7B at 1.00368x is the tightest mixture-of-experts result we have seen anywhere in the public quantization literature at 5 bits. Qwen3-14B at 1.00403x effectively ties the 1.7B small-decoder record at the 14B-parameter class — the codec scales without losing precision. Qwen3-8B at 1.00440x closes the 8B-class gap. Mistral-7B-v0.3 at 1.00548x is a fresh result this week from a tightened training objective. We tell you which numbers we have measured, when, with what conditions — and we do not tell you the others.

What we tried, didn’t work, and published anyway

The honest version of every quantization-research roadmap includes the cures that did not work. Our published catalogue of refuted hypotheses sits at fourteen entries as of this week.

We publish negative results because cherry-picking is the death of compression-research credibility, and because the next direction is empirically motivated by the refutations, not by the cures. If you are evaluating compression vendors, the negative-results catalogue is the document you should read first.

Try it yourself in three commands

The whole point of bit-identical reconstruction is that anyone can verify the pack against the public artifact. uc bench runs a baseline-vs-compressed perplexity comparison against a held-out tail of FineWeb-edu — the same eval harness that produced every number in the table above. No GPU required for verify; one consumer GPU sufficient for bench.

pip install ultracompress
hf download SipsaLabs/qwen3-8b-uc-v3-bpw5 --local-dir ./qwen3-8b-uc-v3-bpw5
uc bench ./qwen3-8b-uc-v3-bpw5

The full set of 22+ Sipsa Labs HuggingFace artifacts, all under the same v3 pack format and all verifiable by the same uc verify CLI, is at huggingface.co/SipsaLabs. The runner and verification harness are BUSL-1.1 in the github.com/sipsalabs/ultracompress repository.

If uc bench reports a number that disagrees with the table above by more than the eval tolerance, we want to know. That is the kind of bug report that pays for itself.

Patents and contact

Method and pack-format protections were filed with the United States Patent and Trademark Office on April 25, 2026: provisionals 64/049,511 and 64/049,517. A supplementary batch is queued.