Live on PyPI · v0.6.9 · uc bench-ppl quality verifier LIVE · $5 free credit, no card

SIPSA LABS / ULTRACOMPRESS

Run a 405B model on a single 32 GB GPU.

UltraCompress is lossless 5-bit transformer compression with an OpenAI-API-compatible inference layer. 5× smaller weights, same task quality. SHA-256-verifiable bit-identical reconstruction — the model your audit reviewed is the model you ship. Same OpenAI SDK, just change the base URL. $5 free credit to start, no card.

Get free $5 credit — no card Try the live demo See pricing

View benchmarks pip install ultracompress Enterprise / on-prem

/ API live — OpenAI-compatible · instant signup → $5 free credit · no card

# Public substrate (no API key required, full inference today):
pip install ultracompress
hf download SipsaLabs/qwen3-8b-uc-v3-bpw5 --local-dir ./qwen3-8b
uc bench ./qwen3-8b

# Managed API (self-serve at sipsalabs.com/pricing — Pro $99/mo, Team $499/mo):
curl https://api.sipsalabs.com/v1/models
# or set OPENAI_BASE_URL=https://api.sipsalabs.com/v1 with the openai SDK

/ Who this is for

Three ways in. Pick your lane.

Same 5-bit substrate, three delivery modes — from a free pip install on your laptop to a managed OpenAI-compatible endpoint to a deployment inside your security boundary.

For developers

I want to run big models on my laptop.

"I have a 5090 / 4090 / Mac with 32 GB. Give me the weights."

Free public substrate. pip install ultracompress, hf download the artifact, run uc bench. No signup. No API key. The same SHA-256 manifest your security team would audit, on your hardware, today.

Install from PyPI →

For companies

I need OpenAI-compatible inference at lower cost.

"Swap the base URL. Cut the bill. Don't rewrite the app."

Managed API. Same OpenAI SDK, just point OPENAI_BASE_URL at api.sipsalabs.com/v1. $5 free credit, no card. Then Pro $99/mo or Team $499/mo — self-serve, instant.

Get $5 free credit →

For enterprise

I need on-prem, air-gapped, or FedRAMP-ready.

"It has to run inside our boundary, under our audit log."

Bit-identical reconstruction inside your VPC, bare-metal cluster, or air-gapped enclave. SOC 2 / SR-11-7 / FDA / DoD-ready architecture. Direct line to founder — 24-hour reply.

See deployment paths →

/ Verified records

The numbers, with receipts.

Every record below has a public Hugging Face artifact and an SHA-256 manifest you can re-verify on your hardware. Perplexity ratio is measured against the bf16 baseline at seq_len=1024, FineWeb-edu held-out tail. Run uc verify and confirm the contract holds — no "trust me."

Hermes-3-Llama-3.1-405B

1.0066×

405B params · runs on a single 32 GB GPU · +0.66% perplexity vs bf16 baseline

Mixtral-8x7B-v0.1 (MoE)

1.00368×

47B params (13B active) · +0.368% perplexity · mixture-of-experts

Qwen3-14B

1.00403×

14.0B params · +0.403% perplexity · scale-invariant codec

Qwen3-8B

1.00440×

8.0B params · +0.440% perplexity · 8B class record

Qwen3-1.7B-Base

1.0040×

1.7B params · +0.401% perplexity · tightest small-decoder record

Mistral-7B-v0.3

1.00548×

7.0B params · +0.548% perplexity · hardest architecture cracked to date

22 architectures shipped · 14 PPL-verified end-to-end · rest in active eval / compression See the full verified matrix →

/ Why customers care

What you actually get.

Four things make this different from every other "model compression" pitch you've seen. Each one is verifiable on your hardware before you sign or pay anything.

Same OpenAI SDK. No rewrite.

Set OPENAI_BASE_URL=https://api.sipsalabs.com/v1 and your existing inference code keeps working. Chat, completions, embeddings — same surface area, same response shape. Drop-in replacement for the OpenAI client in Python, Node, Go, Rust.

SHA-256 reproducibility.

Every artifact ships with a signed manifest. uc verify reconstructs the weights byte-for-byte and confirms the SHA matches. The model your audit reviewed in March is the model your endpoint serves in October. SR-11-7 and FDA SaMD reviews carry through — no "compressed-variant" governance lane.

Lossless, not lossy.

Task quality preserved, measured, published. Perplexity ratios between 1.0026× and 1.0200× against the bf16 baseline — not "looks fine to me," not eyeball-tested on three prompts. Reproduce the eval on your hardware with one command.

5× lower memory footprint.

Fits on consumer GPUs you already own, or 5× the throughput on the GPUs you already rent. Hermes-3-405B on a single RTX 5090. Mixtral-8x7B on a 4090. The cost-per-token math changes when the weights stop spilling into a second box.

/ Quick start

Three paths. Pick one.

Each one runs end-to-end, today. The free path needs no signup. The managed path needs an email. The enterprise path needs a conversation.

01 Free — run it on your hardware No signup

Install the CLI, pull a published artifact, run the verifier and benchmark. Three commands. The full substrate, free, MIT-permissive on the runtime.

# 1. Install
pip install ultracompress

# 2. Pull an artifact (Hugging Face)
hf download SipsaLabs/qwen3-8b-uc-v3-bpw5 --local-dir ./qwen3-8b

# 3. Verify SHA-256 + run benchmark on your hardware
uc verify ./qwen3-8b
uc bench ./qwen3-8b

PyPI package →

02 Managed — OpenAI-compatible endpoint $5 free credit, no card

Sign up with email, get $5 in free inference credit, point your existing OpenAI client at our base URL. Same SDK, same code path, lower bill.

# 1. Get a key at sipsalabs.com/get-access (no card)
export SIPSA_API_KEY=sk-...

# 2. Use the standard OpenAI SDK, just change the base URL
from openai import OpenAI

client = OpenAI(
    base_url="https://api.sipsalabs.com/v1",
    api_key=os.environ["SIPSA_API_KEY"],
)

resp = client.chat.completions.create(
    model="qwen3-8b-uc-v3-bpw5",
    messages=[{"role": "user", "content": "Hello"}],
)

Claim $5 free credit →

03 Enterprise — on-prem, air-gapped, custom Direct from founder

Deploy inside your security boundary. Bring your fine-tune for compression. SOC 2 / SR-11-7 / FDA / DoD-ready architecture. One email, the founder reads it — 24-hour reply.

# Email founder@sipsalabs.com — include:
#   Use case
#   Scale (GPU count, expected token throughput, models)
#   Security boundary (on-prem, VPC, air-gapped)
#   Timeline

See deployment paths →

/ Pricing

Self-serve. Instant.

No sales call to start. No card to try. Pick a tier, point your SDK, ship. Upgrade when the bill stops fitting on the free credit.

Free

$0/mo

For evaluation and small workloads. Real inference, real models, real SDK.

$5 inference credit, no card
OpenAI-compatible API access
All public 5-bit artifacts
Public verifier & benchmark CLI

Start free →

Pro

$99/mo

For solo developers and small teams shipping production inference.

Higher rate limits
Priority routing on shared GPUs
All compressed architectures
Email support

Subscribe Pro →

Team

$499/mo

For teams with steady throughput needs and shared keys.

Team-wide rate limits
Shared API keys & usage dashboard
Dedicated routing on premium GPUs
Slack-shared support channel

Subscribe Team →

Need on-prem, air-gapped, or custom architecture support? See enterprise paths → · Full pricing detail →