Live on PyPI · v0.6.9 · uc bench-ppl quality verifier LIVE · $5 free credit, no card
SIPSA LABS / ULTRACOMPRESS
Run a 405B model on a single 32 GB GPU.
UltraCompress is lossless 5-bit transformer compression with an OpenAI-API-compatible inference layer. 5× smaller weights, same task quality. SHA-256-verifiable bit-identical reconstruction — the model your audit reviewed is the model you ship. Same OpenAI SDK, just change the base URL. $5 free credit to start, no card.
# Public substrate (no API key required, full inference today):
pip install ultracompress
hf download SipsaLabs/qwen3-8b-uc-v3-bpw5 --local-dir ./qwen3-8b
uc bench ./qwen3-8b
# Managed API (self-serve at sipsalabs.com/pricing — Pro $99/mo, Team $499/mo):
curl https://api.sipsalabs.com/v1/models
# or set OPENAI_BASE_URL=https://api.sipsalabs.com/v1 with the openai SDK
/ Who this is for
Three ways in. Pick your lane.
Same 5-bit substrate, three delivery modes — from a free pip install on your laptop to a managed OpenAI-compatible endpoint to a deployment inside your security boundary.
For developers
I want to run big models on my laptop.
"I have a 5090 / 4090 / Mac with 32 GB. Give me the weights."
Free public substrate. pip install ultracompress, hf download the artifact, run uc bench. No signup. No API key. The same SHA-256 manifest your security team would audit, on your hardware, today.
For companies
I need OpenAI-compatible inference at lower cost.
"Swap the base URL. Cut the bill. Don't rewrite the app."
Managed API. Same OpenAI SDK, just point OPENAI_BASE_URL at api.sipsalabs.com/v1. $5 free credit, no card. Then Pro $99/mo or Team $499/mo — self-serve, instant.
For enterprise
I need on-prem, air-gapped, or FedRAMP-ready.
"It has to run inside our boundary, under our audit log."
Bit-identical reconstruction inside your VPC, bare-metal cluster, or air-gapped enclave. SOC 2 / SR-11-7 / FDA / DoD-ready architecture. Direct line to founder — 24-hour reply.
/ Verified records
The numbers, with receipts.
Every record below has a public Hugging Face artifact and an SHA-256 manifest you can re-verify on your hardware. Perplexity ratio is measured against the bf16 baseline at seq_len=1024, FineWeb-edu held-out tail. Run uc verify and confirm the contract holds — no "trust me."
Hermes-3-Llama-3.1-405B
1.0066×
405B params · runs on a single 32 GB GPU · +0.66% perplexity vs bf16 baseline
Mixtral-8x7B-v0.1 (MoE)
1.00368×
47B params (13B active) · +0.368% perplexity · mixture-of-experts
Qwen3-14B
1.00403×
14.0B params · +0.403% perplexity · scale-invariant codec
Qwen3-8B
1.00440×
8.0B params · +0.440% perplexity · 8B class record
Qwen3-1.7B-Base
1.0040×
1.7B params · +0.401% perplexity · tightest small-decoder record
Mistral-7B-v0.3
1.00548×
7.0B params · +0.548% perplexity · hardest architecture cracked to date
/ Why customers care
What you actually get.
Four things make this different from every other "model compression" pitch you've seen. Each one is verifiable on your hardware before you sign or pay anything.
Same OpenAI SDK. No rewrite.
Set OPENAI_BASE_URL=https://api.sipsalabs.com/v1 and your existing inference code keeps working. Chat, completions, embeddings — same surface area, same response shape. Drop-in replacement for the OpenAI client in Python, Node, Go, Rust.
SHA-256 reproducibility.
Every artifact ships with a signed manifest. uc verify reconstructs the weights byte-for-byte and confirms the SHA matches. The model your audit reviewed in March is the model your endpoint serves in October. SR-11-7 and FDA SaMD reviews carry through — no "compressed-variant" governance lane.
Lossless, not lossy.
Task quality preserved, measured, published. Perplexity ratios between 1.0026× and 1.0200× against the bf16 baseline — not "looks fine to me," not eyeball-tested on three prompts. Reproduce the eval on your hardware with one command.
5× lower memory footprint.
Fits on consumer GPUs you already own, or 5× the throughput on the GPUs you already rent. Hermes-3-405B on a single RTX 5090. Mixtral-8x7B on a 4090. The cost-per-token math changes when the weights stop spilling into a second box.
/ Quick start
Three paths. Pick one.
Each one runs end-to-end, today. The free path needs no signup. The managed path needs an email. The enterprise path needs a conversation.
01
Free — run it on your hardware
No signup
Install the CLI, pull a published artifact, run the verifier and benchmark. Three commands. The full substrate, free, MIT-permissive on the runtime.
# 1. Install
pip install ultracompress
# 2. Pull an artifact (Hugging Face)
hf download SipsaLabs/qwen3-8b-uc-v3-bpw5 --local-dir ./qwen3-8b
# 3. Verify SHA-256 + run benchmark on your hardware
uc verify ./qwen3-8b
uc bench ./qwen3-8b
PyPI package →
02
Managed — OpenAI-compatible endpoint
$5 free credit, no card
Sign up with email, get $5 in free inference credit, point your existing OpenAI client at our base URL. Same SDK, same code path, lower bill.
# 1. Get a key at sipsalabs.com/get-access (no card)
export SIPSA_API_KEY=sk-...
# 2. Use the standard OpenAI SDK, just change the base URL
from openai import OpenAI
client = OpenAI(
base_url="https://api.sipsalabs.com/v1",
api_key=os.environ["SIPSA_API_KEY"],
)
resp = client.chat.completions.create(
model="qwen3-8b-uc-v3-bpw5",
messages=[{"role": "user", "content": "Hello"}],
)
Claim $5 free credit →
03
Enterprise — on-prem, air-gapped, custom
Direct from founder
Deploy inside your security boundary. Bring your fine-tune for compression. SOC 2 / SR-11-7 / FDA / DoD-ready architecture. One email, the founder reads it — 24-hour reply.
# Email founder@sipsalabs.com — include:
# Use case
# Scale (GPU count, expected token throughput, models)
# Security boundary (on-prem, VPC, air-gapped)
# Timeline
See deployment paths →
/ Pricing
Self-serve. Instant.
No sales call to start. No card to try. Pick a tier, point your SDK, ship. Upgrade when the bill stops fitting on the free credit.
Free
$0/mo
For evaluation and small workloads. Real inference, real models, real SDK.
- $5 inference credit, no card
- OpenAI-compatible API access
- All public 5-bit artifacts
- Public verifier & benchmark CLI
Start free →
Pro
$99/mo
For solo developers and small teams shipping production inference.
- Higher rate limits
- Priority routing on shared GPUs
- All compressed architectures
- Email support
Subscribe Pro →
Team
$499/mo
For teams with steady throughput needs and shared keys.
- Team-wide rate limits
- Shared API keys & usage dashboard
- Dedicated routing on premium GPUs
- Slack-shared support channel
Subscribe Team →