UltraCompress documentation
Customer-side reproducible 5-bit lossless transformer compression with SHA-256 verifiable bit-identical reconstruction. 22 architectures verified end-to-end. Drop into any PyTorch / vLLM / TensorRT-LLM / sglang inference stack.
Install
pip install ultracompress
# or with uv:
uv pip install ultracompress
Requires Python 3.10+, PyTorch 2.5+, CUDA 12.x for GPU paths. The verifier (uc verify) is CPU-only and works on any machine.
Pull a compressed model
Every Sipsa-compressed model is hosted on the public HuggingFace org SipsaLabs. Use either the hf CLI or any HuggingFace download path:
# Option A — HF CLI (recommended)
hf download SipsaLabs/qwen3-8b-uc-v3-bpw5 --local-dir ./qwen3-8b
# Option B — Python
from huggingface_hub import snapshot_download
snapshot_download("SipsaLabs/qwen3-8b-uc-v3-bpw5", local_dir="./qwen3-8b")
Available models
22 architectures, 0.6B–405B params, ~40 model artifacts total. Browse the full list at huggingface.co/SipsaLabs or see the canonical PPL matrix at /inference. Highlights:
| Model | Params | HF repo | PPL ratio |
|---|---|---|---|
| Hermes-3-Llama-3.1-405B | 405B | SipsaLabs/hermes-3-llama-3.1-405b-uc-v3-bpw5 | 1.0066× |
| Mixtral-8x7B | 47B (MoE) | SipsaLabs/mixtral-8x7b-v0.1-uc-v3-bpw5 | 1.00368× |
| Qwen3-14B | 14B | SipsaLabs/qwen3-14b-uc-v3-bpw5 | 1.00403× |
| Qwen3-8B | 8B | SipsaLabs/qwen3-8b-uc-v3-bpw5 | 1.00440× |
| Mistral-7B-v0.3 | 7B | SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5 | 1.00548× |
| Phi-3-mini-4k-instruct | 3.8B | SipsaLabs/phi-3-mini-4k-instruct-uc-v3-bpw5 | 1.00624× |
Verify SHA-256 reconstruction
Every Sipsa-compressed artifact ships with a per-Linear SHA-256 manifest. Verify locally with zero trust in the vendor:
uc verify ./qwen3-8b
# → SHA-256 manifest verified across N Linear layers
# → PASS (or per-layer hash diffs on FAIL)
Runs on CPU only. No GPU required. This is the cryptographic primitive that AWQ / GPTQ / EXL3 cannot offer (they leave reproducibility ambiguous).
Benchmark TTFT / tokens-per-sec / VRAM
uc bench ./qwen3-8b
# → TTFT: 142 ms
# → tokens/sec: 87.3
# → peak VRAM: 6.2 GB
Measures time-to-first-token, tokens-per-second, and peak GPU memory. Useful for capacity planning and side-by-side comparison with bf16 / AWQ / GPTQ baselines.
Integrate with your inference stack
UltraCompress reconstructs the original PyTorch model bit-identically before any inference framework runs. Once reconstructed, it's a standard transformers model that drops into any inference stack you already use.
vLLM
from vllm import LLM
from ultracompress import unpack
# Reconstruct the model from the compressed pack:
model_dir = unpack("./qwen3-8b", out_dir="./qwen3-8b-reconstructed")
# Then load with vLLM as normal:
llm = LLM(model=model_dir, dtype="bfloat16")
outputs = llm.generate(["Hello, lossless world"])
TensorRT-LLM
# Reconstruct first, then build TRT-LLM engine as normal:
uc unpack ./qwen3-8b ./qwen3-8b-reconstructed
trtllm-build --checkpoint_dir ./qwen3-8b-reconstructed --output_dir ./qwen3-8b-engine
sglang
uc unpack ./qwen3-8b ./qwen3-8b-reconstructed
python -m sglang.launch_server --model-path ./qwen3-8b-reconstructed
Plain transformers
from ultracompress import load_compressed
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./qwen3-8b")
model = load_compressed("./qwen3-8b") # reconstructs in <10 sec for 7B-class
inputs = tokenizer("Hello, lossless world", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0]))
Use the managed API (private beta)
The OpenAI-compatible inference API at api.sipsalabs.com/v1 is currently in private beta. Email founder@sipsalabs.com for access (24-hour turnaround) or use the structured intake at /get-access. Once you have a key:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.sipsalabs.com/v1",
api_key=os.environ["SIPSA_API_KEY"],
)
response = client.chat.completions.create(
model="hermes-3-405b",
messages=[{"role": "user", "content": "hello, lossless world"}],
)
print(response.choices[0].message.content)
Streaming, function-calling, and the other endpoints the openai SDK speaks all work the same way.
CLI reference
| Command | Purpose |
|---|---|
uc verify <dir> | Verify SHA-256 manifest reconstruction (CPU only) |
uc bench <dir> | Measure TTFT / tokens-per-sec / VRAM (GPU required) |
uc unpack <dir> --out <out> | Reconstruct compressed pack into standard transformers checkpoint |
uc verify-org SipsaLabs | Verify all artifacts in a HuggingFace org match their SHA-256 manifests |
uc status | Print local cache + version info |
uc --version | Print package version |
FAQ
What's the difference between Sipsa and AWQ / GPTQ / EXL3?
Tightest PPL drift in the 4–5 bpw band, plus SHA-256 bit-identical reconstruction (the cryptographic guarantee that the customer-loaded model produces output identical to the trainer's reference). AWQ / GPTQ / EXL3 leave reproducibility ambiguous — you cannot bit-compare two runs. Sipsa proves bit-identical via per-Linear hash manifest. For SOC 2 / SR-11-7 / FDA / DoD audit, the difference is qualitative.
Does this require special hardware?
No. The reconstructed model runs on the standard PyTorch path. CUDA GPU for inference (consumer or datacenter), CPU only for uc verify. Tested on RTX 5090 / A100 / H100 / consumer Apple Silicon (Metal backend via PyTorch).
What's the inference speed overhead?
Reconstruction happens at load time (~5–10 sec for 70B-class). Subsequent inference uses the standard PyTorch / vLLM / TensorRT-LLM path on the reconstructed model — no per-token overhead. The win is GPU memory: 3–4× lower footprint = more concurrent context per GPU-hour.
What's the license?
BUSL-1.1 with Additional Use Grant: free for personal use, research, and any company under $1M ARR. Auto-converts to Apache 2.0 four years after each release. Commercial license at scale via founder@sipsalabs.com.
What about 100B+ models?
Hermes-3-Llama-3.1-405B is the largest fully-verified artifact today. Mixtral-8x22B (141B MoE), Qwen3-235B-A22B, Phi-3.5-MoE all verified. The trillion-class roadmap targets DeepSeek-V3 685B once disk + MLA arch support land in the trainer (Q3 2026).
How do I cite Sipsa in a paper?
@misc{sipsa2026,
title = {UltraCompress: SHA-256 verifiable lossless 5-bit transformer compression},
author = {Ounnar, Sip},
year = {2026},
note = {Sipsa Labs, Inc. \url{https://sipsalabs.com}}
}