Docsv0.6.3

UltraCompress documentation

Customer-side reproducible 5-bit lossless transformer compression with SHA-256 verifiable bit-identical reconstruction. 22 architectures verified end-to-end. Drop into any PyTorch / vLLM / TensorRT-LLM / sglang inference stack.

Contents

Install

pip install ultracompress
# or with uv:
uv pip install ultracompress

Requires Python 3.10+, PyTorch 2.5+, CUDA 12.x for GPU paths. The verifier (uc verify) is CPU-only and works on any machine.

Pull a compressed model

Every Sipsa-compressed model is hosted on the public HuggingFace org SipsaLabs. Use either the hf CLI or any HuggingFace download path:

# Option A — HF CLI (recommended)
hf download SipsaLabs/qwen3-8b-uc-v3-bpw5 --local-dir ./qwen3-8b

# Option B — Python
from huggingface_hub import snapshot_download
snapshot_download("SipsaLabs/qwen3-8b-uc-v3-bpw5", local_dir="./qwen3-8b")

Available models

22 architectures, 0.6B–405B params, ~40 model artifacts total. Browse the full list at huggingface.co/SipsaLabs or see the canonical PPL matrix at /inference. Highlights:

ModelParamsHF repoPPL ratio
Hermes-3-Llama-3.1-405B405BSipsaLabs/hermes-3-llama-3.1-405b-uc-v3-bpw51.0066×
Mixtral-8x7B47B (MoE)SipsaLabs/mixtral-8x7b-v0.1-uc-v3-bpw51.00368×
Qwen3-14B14BSipsaLabs/qwen3-14b-uc-v3-bpw51.00403×
Qwen3-8B8BSipsaLabs/qwen3-8b-uc-v3-bpw51.00440×
Mistral-7B-v0.37BSipsaLabs/mistral-7b-v0.3-uc-v3-bpw51.00548×
Phi-3-mini-4k-instruct3.8BSipsaLabs/phi-3-mini-4k-instruct-uc-v3-bpw51.00624×

Verify SHA-256 reconstruction

Every Sipsa-compressed artifact ships with a per-Linear SHA-256 manifest. Verify locally with zero trust in the vendor:

uc verify ./qwen3-8b
# → SHA-256 manifest verified across N Linear layers
# → PASS (or per-layer hash diffs on FAIL)

Runs on CPU only. No GPU required. This is the cryptographic primitive that AWQ / GPTQ / EXL3 cannot offer (they leave reproducibility ambiguous).

Benchmark TTFT / tokens-per-sec / VRAM

uc bench ./qwen3-8b
# → TTFT: 142 ms
# → tokens/sec: 87.3
# → peak VRAM: 6.2 GB

Measures time-to-first-token, tokens-per-second, and peak GPU memory. Useful for capacity planning and side-by-side comparison with bf16 / AWQ / GPTQ baselines.

Integrate with your inference stack

UltraCompress reconstructs the original PyTorch model bit-identically before any inference framework runs. Once reconstructed, it's a standard transformers model that drops into any inference stack you already use.

vLLM

from vllm import LLM
from ultracompress import unpack

# Reconstruct the model from the compressed pack:
model_dir = unpack("./qwen3-8b", out_dir="./qwen3-8b-reconstructed")

# Then load with vLLM as normal:
llm = LLM(model=model_dir, dtype="bfloat16")
outputs = llm.generate(["Hello, lossless world"])

TensorRT-LLM

# Reconstruct first, then build TRT-LLM engine as normal:
uc unpack ./qwen3-8b ./qwen3-8b-reconstructed
trtllm-build --checkpoint_dir ./qwen3-8b-reconstructed --output_dir ./qwen3-8b-engine

sglang

uc unpack ./qwen3-8b ./qwen3-8b-reconstructed
python -m sglang.launch_server --model-path ./qwen3-8b-reconstructed

Plain transformers

from ultracompress import load_compressed
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./qwen3-8b")
model = load_compressed("./qwen3-8b")  # reconstructs in <10 sec for 7B-class

inputs = tokenizer("Hello, lossless world", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0]))

Use the managed API (private beta)

The OpenAI-compatible inference API at api.sipsalabs.com/v1 is currently in private beta. Email founder@sipsalabs.com for access (24-hour turnaround) or use the structured intake at /get-access. Once you have a key:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.sipsalabs.com/v1",
    api_key=os.environ["SIPSA_API_KEY"],
)

response = client.chat.completions.create(
    model="hermes-3-405b",
    messages=[{"role": "user", "content": "hello, lossless world"}],
)
print(response.choices[0].message.content)

Streaming, function-calling, and the other endpoints the openai SDK speaks all work the same way.

CLI reference

CommandPurpose
uc verify <dir>Verify SHA-256 manifest reconstruction (CPU only)
uc bench <dir>Measure TTFT / tokens-per-sec / VRAM (GPU required)
uc unpack <dir> --out <out>Reconstruct compressed pack into standard transformers checkpoint
uc verify-org SipsaLabsVerify all artifacts in a HuggingFace org match their SHA-256 manifests
uc statusPrint local cache + version info
uc --versionPrint package version

FAQ

What's the difference between Sipsa and AWQ / GPTQ / EXL3?

Tightest PPL drift in the 4–5 bpw band, plus SHA-256 bit-identical reconstruction (the cryptographic guarantee that the customer-loaded model produces output identical to the trainer's reference). AWQ / GPTQ / EXL3 leave reproducibility ambiguous — you cannot bit-compare two runs. Sipsa proves bit-identical via per-Linear hash manifest. For SOC 2 / SR-11-7 / FDA / DoD audit, the difference is qualitative.

Does this require special hardware?

No. The reconstructed model runs on the standard PyTorch path. CUDA GPU for inference (consumer or datacenter), CPU only for uc verify. Tested on RTX 5090 / A100 / H100 / consumer Apple Silicon (Metal backend via PyTorch).

What's the inference speed overhead?

Reconstruction happens at load time (~5–10 sec for 70B-class). Subsequent inference uses the standard PyTorch / vLLM / TensorRT-LLM path on the reconstructed model — no per-token overhead. The win is GPU memory: 3–4× lower footprint = more concurrent context per GPU-hour.

What's the license?

BUSL-1.1 with Additional Use Grant: free for personal use, research, and any company under $1M ARR. Auto-converts to Apache 2.0 four years after each release. Commercial license at scale via founder@sipsalabs.com.

What about 100B+ models?

Hermes-3-Llama-3.1-405B is the largest fully-verified artifact today. Mixtral-8x22B (141B MoE), Qwen3-235B-A22B, Phi-3.5-MoE all verified. The trillion-class roadmap targets DeepSeek-V3 685B once disk + MLA arch support land in the trainer (Q3 2026).

How do I cite Sipsa in a paper?

@misc{sipsa2026,
  title  = {UltraCompress: SHA-256 verifiable lossless 5-bit transformer compression},
  author = {Ounnar, Sip},
  year   = {2026},
  note   = {Sipsa Labs, Inc. \url{https://sipsalabs.com}}
}

Read more