Verified inference benchmarks.
Every row below has a public HuggingFace artifact, an SHA-256 manifest you can re-verify on your hardware, and a JSON evaluation receipt. No "trust me bro" — run uc verify and confirm the contract holds.
The verified matrix
14 architectures end-to-end at 5 bits per weight. Every PPL ratio measured under streaming-per-layer reconstruction comparator at seq_len=1024, seed=42, FineWeb-edu held-out tail (n=30–50 unless noted).
| Model | Params | PPL ratio | Drift | Baseline → Compressed | HF artifact | Status |
|---|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct | 3.8B | 1.00262× | +0.262% | seq_len=128 caveat | phi-3-mini-4k-instruct-uc-v3-bpw5 | live |
| Mixtral-8x7B-v0.1 (MoE) | 47B (13B active) | 1.00368× | +0.368% | — | mixtral-8x7b-v0.1-uc-v3-bpw5 | gated |
| Qwen3-1.7B-Base | 1.7B | 1.00401× | +0.401% | tightest small-decoder | qwen3-1.7b-base-uc-v3-bpw5 | live |
| Qwen3-14B | 14.0B | 1.00403× | +0.403% | scale-invariant codec | qwen3-14b-uc-v3-bpw5 | gated |
| Yi-1.5-9B | 9.0B | 1.00414× | +0.414% | tightest 8-9B dense | yi-1.5-9b-uc-v3-bpw5 | gated |
| Qwen3-8B | 8.0B | 1.00440× | +0.440% | 8B class record | qwen3-8b-uc-v3-bpw5 | live |
| Mistral-7B-v0.3 ⚡ NEW | 7.0B | 1.00548× | +0.548% | 5th cure attempt cracked it (4 prior refuted) | mistral-7b-v0.3-uc-v3-bpw5 | live |
| Hermes-3-Llama-3.1-405B 🔥 HEADLINE | 405B | 1.0066× | +0.66% | 5.0358 → 5.0692, single 32 GB GPU | hermes-3-llama-3.1-405b-uc-v3-bpw5 | gated |
| Qwen3-0.6B | 0.6B | 1.0069× | +0.69% | — | qwen3-0.6b-uc-v3-bpw5 | live |
| OLMo-2-0425-1B | 1.0B | 1.0073× | +0.73% | — | olmo-2-0425-1b-uc-v3-bpw5 | live |
| OLMo-2-0425-1B-Instruct | 1.0B | 0.9998× | −0.02% | regularization observed | olmo-2-0425-1b-instruct-uc-v3-bpw5 | live |
| SmolLM2-1.7B-Instruct | 1.7B | 1.0075× | +0.75% | — | smollm2-1.7b-instruct-uc-v3-bpw5 | live |
| SmolLM2-1.7B | 1.7B | 1.0085× | +0.85% | — | smollm2-1.7b-uc-v3-bpw5 | live |
| Llama-3.1-8B | 8.0B | 1.0125× | +1.25% | baseline; in-band | llama-3.1-8b-uc-v3-bpw5 | live |
Mean across 14 verified records: 1.00554×. Median: 1.00494×.
Reproduce any row in 3 commands
Pick a row above. Copy the HF artifact name. Run:
pip install ultracompress
hf download SipsaLabs/qwen3-1.7b-base-uc-v3-bpw5 --local-dir ./pack
uc verify ./pack
For gated artifacts (10B+), click "Request access" on the HF page. Manual approval, usually within 24h. Free for sub-$1M ARR companies, individuals, research.
uc verify reads the SHA-256 manifest and confirms every layer reconstructs to the bytes the trainer wrote. If a single byte drifts, it fails loudly. This is the difference between "lossless" as marketing and "lossless" as a contract you can audit.
Eval methodology
- Comparator: per-layer streaming reconstruction. Both bf16 baseline and 5-bit compressed model use the same procedure on the same hardware (single 32 GB GPU), so the ratio isolates the codec contribution rather than confounding it with serving-stack differences.
- Dataset: FineWeb-edu held-out tail (no overlap with calibration). Seq length 1024 unless noted.
- n: 30–50 documents per row. Seed 42, deterministic.
- Hardware: dual RTX 5090 (32 GB each). 405B-class fits inside 32 GB peak via streaming compression — single consumer GPU is part of the value proposition.
- JSON receipts: every row's underlying eval JSON is in
scripts/overlay/artifacts/in the public repo. We don't ship round numbers without the source data.
Honest negative results
We catalogue what doesn't work at the same level of detail as what does. A few examples:
- Qwen3-1.7B-Base 1.0040× is the empirical floor. 5 cure paths refuted (rank+steps push, per-Linear adaptive bpw, depth-adaptive train_steps, multi-pass cascade, AWQ-style channel pre-scaling). Three within noise; two catastrophic regressions (1.0682× and 1.1306×).
- State-space models past scalar-only: Mamba-2.8B at 1.0119× with the codec alone. Two correction-overlay attempts (SVD warm-start; per-Linear KL on Gaussian inputs) made it worse, not better.
- Mistral-7B took multiple iterations. Several training-objective variants on the same architecture sat above 1.05× before a methodologically distinct objective (recipe patent-protected) landed 1.00548×. Lesson: when several refinements within a hypothesis class all fail, the bottleneck isn’t the search, it’s the hypothesis class.
- TinyLlama-1.1B-Chat: pack verifies clean but PPL eval throws CUDA device-side assert. Documented as deferred, not a fabricated number.
Full catalog (15+ entries) at github.com/sipsalabs/ultracompress/blob/main/docs/HONEST_NEGATIVE_RESULTS_2026_05_08.md.
Want to use these via API?
Same model menu, OpenAI-compatible.
from openai import OpenAI
client = OpenAI(base_url="https://api.sipsalabs.com/v1", api_key="sk-...")
resp = client.chat.completions.create(
model="hermes-3-405b",
messages=[{"role": "user", "content": "test"}],
)
print(resp.choices[0].message.content)
$5 free credits on signup, no card. See /pricing for the full per-model token rates.
Questions about a specific row?
Direct line to the founder. Single solo founder; you'll hear back within 4–8h US business hours.
- Reproduce verification: founder@sipsalabs.com
- Architecture not in the list: Compression-as-a-Service, see /pricing
- Press / investor: press@sipsalabs.com