Docsv0.6.3

UltraCompress documentation

Customer-side reproducible 5-bit lossless transformer compression with SHA-256 verifiable bit-identical reconstruction. 22 architectures verified end-to-end. Drop into any PyTorch / vLLM / TensorRT-LLM / sglang inference stack.

Contents

Install
Pull a compressed model
Verify SHA-256 reconstruction
Benchmark TTFT / tokens-per-sec / VRAM
Integrate with your inference stack
Use the managed API (private beta)
CLI reference
FAQ

Install

pip install ultracompress
# or with uv:
uv pip install ultracompress

Requires Python 3.10+, PyTorch 2.5+, CUDA 12.x for GPU paths. The verifier (uc verify) is CPU-only and works on any machine.

Pull a compressed model

Every Sipsa-compressed model is hosted on the public HuggingFace org SipsaLabs. Use either the hf CLI or any HuggingFace download path:

# Option A — HF CLI (recommended)
hf download SipsaLabs/qwen3-8b-uc-v3-bpw5 --local-dir ./qwen3-8b

# Option B — Python
from huggingface_hub import snapshot_download
snapshot_download("SipsaLabs/qwen3-8b-uc-v3-bpw5", local_dir="./qwen3-8b")

Available models

22 architectures, 0.6B–405B params, ~40 model artifacts total. Browse the full list at huggingface.co/SipsaLabs or see the canonical PPL matrix at /inference. Highlights:

Model	Params	HF repo	PPL ratio
Hermes-3-Llama-3.1-405B	405B	`SipsaLabs/hermes-3-llama-3.1-405b-uc-v3-bpw5`	1.0066×
Mixtral-8x7B	47B (MoE)	`SipsaLabs/mixtral-8x7b-v0.1-uc-v3-bpw5`	1.00368×
Qwen3-14B	14B	`SipsaLabs/qwen3-14b-uc-v3-bpw5`	1.00403×
Qwen3-8B	8B	`SipsaLabs/qwen3-8b-uc-v3-bpw5`	1.00440×
Mistral-7B-v0.3	7B	`SipsaLabs/mistral-7b-v0.3-uc-v3-bpw5`	1.00548×
Phi-3-mini-4k-instruct	3.8B	`SipsaLabs/phi-3-mini-4k-instruct-uc-v3-bpw5`	1.00624×

Verify SHA-256 reconstruction

Every Sipsa-compressed artifact ships with a per-Linear SHA-256 manifest. Verify locally with zero trust in the vendor:

uc verify ./qwen3-8b
# → SHA-256 manifest verified across N Linear layers
# → PASS (or per-layer hash diffs on FAIL)

Runs on CPU only. No GPU required. This is the cryptographic primitive that AWQ / GPTQ / EXL3 cannot offer (they leave reproducibility ambiguous).

Benchmark TTFT / tokens-per-sec / VRAM

uc bench ./qwen3-8b
# → TTFT: 142 ms
# → tokens/sec: 87.3
# → peak VRAM: 6.2 GB

Measures time-to-first-token, tokens-per-second, and peak GPU memory. Useful for capacity planning and side-by-side comparison with bf16 / AWQ / GPTQ baselines.

Integrate with your inference stack

UltraCompress reconstructs the original PyTorch model bit-identically before any inference framework runs. Once reconstructed, it's a standard transformers model that drops into any inference stack you already use.

vLLM

from vllm import LLM
from ultracompress import unpack

# Reconstruct the model from the compressed pack:
model_dir = unpack("./qwen3-8b", out_dir="./qwen3-8b-reconstructed")

# Then load with vLLM as normal:
llm = LLM(model=model_dir, dtype="bfloat16")
outputs = llm.generate(["Hello, lossless world"])

TensorRT-LLM

# Reconstruct first, then build TRT-LLM engine as normal:
uc unpack ./qwen3-8b ./qwen3-8b-reconstructed
trtllm-build --checkpoint_dir ./qwen3-8b-reconstructed --output_dir ./qwen3-8b-engine

sglang

uc unpack ./qwen3-8b ./qwen3-8b-reconstructed
python -m sglang.launch_server --model-path ./qwen3-8b-reconstructed

Plain transformers

from ultracompress import load_compressed
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./qwen3-8b")
model = load_compressed("./qwen3-8b")  # reconstructs in <10 sec for 7B-class

inputs = tokenizer("Hello, lossless world", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0]))

Use the managed API (private beta)

The OpenAI-compatible inference API at api.sipsalabs.com/v1 is currently in private beta. Email founder@sipsalabs.com for access (24-hour turnaround) or use the structured intake at /get-access. Once you have a key:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.sipsalabs.com/v1",
    api_key=os.environ["SIPSA_API_KEY"],
)

response = client.chat.completions.create(
    model="hermes-3-405b",
    messages=[{"role": "user", "content": "hello, lossless world"}],
)
print(response.choices[0].message.content)

Streaming, function-calling, and the other endpoints the openai SDK speaks all work the same way.

CLI reference

Command	Purpose
`uc verify <dir>`	Verify SHA-256 manifest reconstruction (CPU only)
`uc bench <dir>`	Measure TTFT / tokens-per-sec / VRAM (GPU required)
`uc unpack <dir> --out <out>`	Reconstruct compressed pack into standard transformers checkpoint
`uc verify-org SipsaLabs`	Verify all artifacts in a HuggingFace org match their SHA-256 manifests
`uc status`	Print local cache + version info
`uc --version`	Print package version

FAQ

What's the difference between Sipsa and AWQ / GPTQ / EXL3?

Tightest PPL drift in the 4–5 bpw band, plus SHA-256 bit-identical reconstruction (the cryptographic guarantee that the customer-loaded model produces output identical to the trainer's reference). AWQ / GPTQ / EXL3 leave reproducibility ambiguous — you cannot bit-compare two runs. Sipsa proves bit-identical via per-Linear hash manifest. For SOC 2 / SR-11-7 / FDA / DoD audit, the difference is qualitative.

Does this require special hardware?

No. The reconstructed model runs on the standard PyTorch path. CUDA GPU for inference (consumer or datacenter), CPU only for uc verify. Tested on RTX 5090 / A100 / H100 / consumer Apple Silicon (Metal backend via PyTorch).

What's the inference speed overhead?

Reconstruction happens at load time (~5–10 sec for 70B-class). Subsequent inference uses the standard PyTorch / vLLM / TensorRT-LLM path on the reconstructed model — no per-token overhead. The win is GPU memory: 3–4× lower footprint = more concurrent context per GPU-hour.

What's the license?

BUSL-1.1 with Additional Use Grant: free for personal use, research, and any company under $1M ARR. Auto-converts to Apache 2.0 four years after each release. Commercial license at scale via founder@sipsalabs.com.

What about 100B+ models?

Hermes-3-Llama-3.1-405B is the largest fully-verified artifact today. Mixtral-8x22B (141B MoE), Qwen3-235B-A22B, Phi-3.5-MoE all verified. The trillion-class roadmap targets DeepSeek-V3 685B once disk + MLA arch support land in the trainer (Q3 2026).

How do I cite Sipsa in a paper?

@misc{sipsa2026,
  title  = {UltraCompress: SHA-256 verifiable lossless 5-bit transformer compression},
  author = {Ounnar, Sip},
  year   = {2026},
  note   = {Sipsa Labs, Inc. \url{https://sipsalabs.com}}
}