Verified inference benchmarks.

Every row below has a public HuggingFace artifact, an SHA-256 manifest you can re-verify on your hardware, and a JSON evaluation receipt. No "trust me bro" — run uc verify and confirm the contract holds.

verified records

architectures

1.0040×

tightest record

405B

largest verified

The verified matrix

14 architectures end-to-end at 5 bits per weight. Every PPL ratio measured under streaming-per-layer reconstruction comparator at seq_len=1024, seed=42, FineWeb-edu held-out tail (n=30–50 unless noted).

Model	Params	PPL ratio	Drift	Baseline → Compressed	HF artifact	Status
Phi-3-mini-4k-instruct	3.8B	1.00262×	+0.262%	seq_len=128 caveat	phi-3-mini-4k-instruct-uc-v3-bpw5	live
Mixtral-8x7B-v0.1 (MoE)	47B (13B active)	1.00368×	+0.368%	—	mixtral-8x7b-v0.1-uc-v3-bpw5	gated
Qwen3-1.7B-Base	1.7B	1.00401×	+0.401%	tightest small-decoder	qwen3-1.7b-base-uc-v3-bpw5	live
Qwen3-14B	14.0B	1.00403×	+0.403%	scale-invariant codec	qwen3-14b-uc-v3-bpw5	gated
Yi-1.5-9B	9.0B	1.00414×	+0.414%	tightest 8-9B dense	yi-1.5-9b-uc-v3-bpw5	gated
Qwen3-8B	8.0B	1.00440×	+0.440%	8B class record	qwen3-8b-uc-v3-bpw5	live
Mistral-7B-v0.3 ⚡ NEW	7.0B	1.00548×	+0.548%	5th cure attempt cracked it (4 prior refuted)	mistral-7b-v0.3-uc-v3-bpw5	live
Hermes-3-Llama-3.1-405B 🔥 HEADLINE	405B	1.0066×	+0.66%	5.0358 → 5.0692, single 32 GB GPU	hermes-3-llama-3.1-405b-uc-v3-bpw5	gated
Qwen3-0.6B	0.6B	1.0069×	+0.69%	—	qwen3-0.6b-uc-v3-bpw5	live
OLMo-2-0425-1B	1.0B	1.0073×	+0.73%	—	olmo-2-0425-1b-uc-v3-bpw5	live
OLMo-2-0425-1B-Instruct	1.0B	0.9998×	−0.02%	regularization observed	olmo-2-0425-1b-instruct-uc-v3-bpw5	live
SmolLM2-1.7B-Instruct	1.7B	1.0075×	+0.75%	—	smollm2-1.7b-instruct-uc-v3-bpw5	live
SmolLM2-1.7B	1.7B	1.0085×	+0.85%	—	smollm2-1.7b-uc-v3-bpw5	live
Llama-3.1-8B	8.0B	1.0125×	+1.25%	baseline; in-band	llama-3.1-8b-uc-v3-bpw5	live

Mean across 14 verified records: 1.00554×. Median: 1.00494×.

Reproduce any row in 3 commands

Pick a row above. Copy the HF artifact name. Run:

pip install ultracompress
hf download SipsaLabs/qwen3-1.7b-base-uc-v3-bpw5 --local-dir ./pack
uc verify ./pack

For gated artifacts (10B+), click "Request access" on the HF page. Manual approval, usually within 24h. Free for sub-$1M ARR companies, individuals, research.

The verifier is the contract. uc verify reads the SHA-256 manifest and confirms every layer reconstructs to the bytes the trainer wrote. If a single byte drifts, it fails loudly. This is the difference between "lossless" as marketing and "lossless" as a contract you can audit.

Eval methodology

Comparator: per-layer streaming reconstruction. Both bf16 baseline and 5-bit compressed model use the same procedure on the same hardware (single 32 GB GPU), so the ratio isolates the codec contribution rather than confounding it with serving-stack differences.
Dataset: FineWeb-edu held-out tail (no overlap with calibration). Seq length 1024 unless noted.
n: 30–50 documents per row. Seed 42, deterministic.
Hardware: dual RTX 5090 (32 GB each). 405B-class fits inside 32 GB peak via streaming compression — single consumer GPU is part of the value proposition.
JSON receipts: every row's underlying eval JSON is in scripts/overlay/artifacts/ in the public repo. We don't ship round numbers without the source data.

Honest negative results

We catalogue what doesn't work at the same level of detail as what does. A few examples:

Qwen3-1.7B-Base 1.0040× is the empirical floor. 5 cure paths refuted (rank+steps push, per-Linear adaptive bpw, depth-adaptive train_steps, multi-pass cascade, AWQ-style channel pre-scaling). Three within noise; two catastrophic regressions (1.0682× and 1.1306×).
State-space models past scalar-only: Mamba-2.8B at 1.0119× with the codec alone. Two correction-overlay attempts (SVD warm-start; per-Linear KL on Gaussian inputs) made it worse, not better.
Mistral-7B took multiple iterations. Several training-objective variants on the same architecture sat above 1.05× before a methodologically distinct objective (recipe patent-protected) landed 1.00548×. Lesson: when several refinements within a hypothesis class all fail, the bottleneck isn’t the search, it’s the hypothesis class.
TinyLlama-1.1B-Chat: pack verifies clean but PPL eval throws CUDA device-side assert. Documented as deferred, not a fabricated number.

Full catalog (15+ entries) at github.com/sipsalabs/ultracompress/blob/main/docs/HONEST_NEGATIVE_RESULTS_2026_05_08.md.

Want to use these via API?

Same model menu, OpenAI-compatible.

from openai import OpenAI
client = OpenAI(base_url="https://api.sipsalabs.com/v1", api_key="sk-...")
resp = client.chat.completions.create(
    model="hermes-3-405b",
    messages=[{"role": "user", "content": "test"}],
)
print(resp.choices[0].message.content)

$5 free credits on signup, no card. See /pricing for the full per-model token rates.

API status: Self-Serve API is in private beta. The endpoint goes public alongside the Mon launch. For early access: founder@sipsalabs.com.

Questions about a specific row?

Direct line to the founder. Single solo founder; you'll hear back within 4–8h US business hours.

Reproduce verification: founder@sipsalabs.com
Architecture not in the list: Compression-as-a-Service, see /pricing
Press / investor: press@sipsalabs.com