Sipsa Labs · Pricing

Pay for what you use. Scale on your terms.

Free credit to start, no card. Pro from $20/mo, Max from $100/mo, Team from $25/seat/mo. Enterprise contracts for regulated, dedicated, or on-prem deployments. Same OpenAI-compatible SDK on every tier.

1.0066×
Hermes-3-405B PPL ratio
on a single 32 GB consumer GPU
23
architectures verified · 22 PPL + 1 ViT cosine
(public registry)
SHA-256
reproducible reconstruction
verifiable in seconds with uc verify
30:23
honest-negatives:wins ratio
(public failure ledger)
Archive notice — this page is preserved for historical reference. The Sipsa Inference service was discontinued in June 2026 and nothing here can be purchased.
/ Self-serve plans

Four tiers. Pick the one that fits.

Every tier ships the same OpenAI-compatible API and the same SHA-256-verifiable substrate. What changes: monthly throughput, model catalog access, and seat count. Cancel any time.

Free

$0
$5 startup credit

For testing the API before you commit. No card.

  • 100 reqs/day rate limit
  • 8 free models (Qwen3-0.6B/1.7B/1.7B-Base, TinyLlama-1.1B, SmolLM2-1.7B, OLMo-2-1B, Mamba-2.8B, Phi-3-mini-4k)
  • $5 of usage credit on signup
  • OpenAI-compatible REST API
  • Real-time usage dashboard
  • Community support (GitHub issues)
  • No credit card required
Get $5 credit

Max

$100/ month
5× Pro quota

For production apps with real customer traffic.

  • 5× Pro monthly quota
  • Full catalog except 405B-flagship
  • Highest priority queue
  • Email support, 1-business-day reply
  • 99.5% SLA on inference availability
Subscribe Max 5×

Team

$25/ seat / month
5 seats minimum

For engineering teams with central billing and admin.

  • Max-5× quota per seat
  • Full catalog including 405B-flagship
  • 5–150 seats, central billing
  • SSO + admin controls
  • Audit logs (per-key, per-request)
  • 99.5% SLA on inference availability
Subscribe Team
/ Sales-led

Beyond self-serve. Enterprise contracts.

For dedicated capacity, SOC 2 + audit support, on-prem appliances, or FDA SaMD / SR 11-7 / DoD ATO compliance work. Contracts are value-based and structured around the compute savings the deployment unlocks.

Enterprise

Custom
Sales-led · dedicated capacity · on-prem

For organizations that need dedicated capacity, compliance support, air-gapped deployments, or Compression-as-a-Service for custom fine-tunes.

Typical pricing: 30–40% of your monthly compute savings, paid as a flat monthly fee with a dedicated SLA, plus per-deployment-attestation pricing for regulated production.
  • Everything in Max 20× / Team
  • Dedicated capacity (reserved GPU slice)
  • Full SLA with contractual remedies
  • On-prem appliance option (air-gapped capable)
  • FDA SaMD / SR 11-7 / DoD ATO support
  • SOC 2 questionnaire support
  • 24/7 support + direct line to founder
  • Compression-as-a-Service for your fine-tunes
  • MSA negotiable · NDAs accepted
Talk to us →
Need a one-time evaluation? The Phase 0 POC at sipsalabs.com/poc is a fixed-fee $5K engagement: your model, your hardware, five business days, one signed reconstruction audit report. Use it to evaluate UltraCompress before you commit to Enterprise.
/ One-time top-ups

Hit a spike? Top up without upgrading.

Top-ups apply to any plan. Credits roll over for 12 months. Use top-ups for traffic spikes instead of upgrading the recurring tier mid-month.

$25
Cover a spike
$100
Pro overage · first-purchase from Free
$500
Production spike · bulk credit
/ Compare tiers

Side by side. No surprises.

Every line item below is what you actually get when you sign up — not a marketing rounded-up version.

Feature Free Pro Max 5× Max 20× Team Enterprise
Monthly price $0 $20 $100 $200 $25/seat Custom
Annual price $204/yr $1,000/yr $2,000/yr $240/seat/yr Custom
Free $5 credit on signup n/a
Monthly quota (vs Pro baseline) 100/day 20× 5× per seat contract
8 free models
Full catalog (all served models) ✓ (except 405B) ✓ (except 405B)
405B-flagship (Hermes-3-405B)
Priority queue highest highest highest dedicated
Seats 1 1 1 1 5–150 custom
SSO + admin
Audit logs (per-request)
SLA on inference availability best-effort best-effort 99.5% 99.5% 99.5% contract
Support channel community email 1bd email 1bd email 1bd email 1bd 24/7 + founder
SOC 2 questionnaire support
On-prem / air-gapped deploy
FDA SaMD / SR 11-7 / DoD ATO
/ What you actually get

In plain English. One paragraph per tier.

Cards and tables are useful. So is a normal paragraph. Here's what each plan looks like the day after you sign up.

Free$5 credit

You sign up at /get-access with an email, no card. We hand you an API key in three seconds and credit your account with $5. You can call any of the 8 free models (Qwen3-0.6B, Qwen3-1.7B, Qwen3-1.7B-Base, TinyLlama-1.1B, SmolLM2-1.7B, OLMo-2-1B, Mamba-2.8B, Phi-3-mini-4k) with 100 requests per day. Larger models like Qwen3-8B, Mistral-7B-v0.3 and Yi-1.5-9B are request / paid-tier — run uc catalog for the live free/request/POC tier of every model. The dashboard shows your credit burn-down in real time. When the $5 runs out, calls return a 402 instead of a Stripe surprise. Use this to validate the SDK swap and benchmark a model against your real workload.

Pro$20 / mo

The main individual tier. You get a generous monthly request quota, access to the full served catalog except the 405B-flagship class, and priority over Free traffic. One-business-day email support. Top up with $25 / $100 / $500 if you blow through your quota. Cancel any time — or save 15% with the annual plan ($204/yr, $17/mo effective).

Max 5×$100 / mo

Five times the Pro quota. Everything in Pro plus highest-priority queue and a 99.5% SLA on inference availability. For production apps that have graduated from prototyping and need predictable throughput. Annual: $1,000/yr ($83/mo effective).

Max 20×$200 / mo

Twenty times the Pro quota, plus 405B-flagship access (Hermes-3-405B, 1.0066× PPL) and per-request audit logs. The top individual tier. Annual: $2,000/yr ($167/mo effective).

Team$25 / seat / mo

Everything in Max 5×, plus 405B-flagship access, multi-seat central billing, SSO, admin controls, and per-request audit logs. 5–150 seats. Pick this when you have a real product, real customers, and a CTO who wants traffic to not be one of the unknowns. Annual: $240/seat/yr ($20/seat/mo effective).

Enterprisecustom

Dedicated GPU capacity, full SLA with contractual remedies, on-prem appliance option (air-gapped capable for ITAR / classified workloads), FDA SaMD / SR 11-7 / DoD ATO compliance support, dedicated security review documentation, 24/7 support with a direct line to the founder, Compression-as-a-Service for your fine-tunes, and per-deployment attestation pricing for regulated production. Talk to us through the Phase 0 scope form.

/ Estimate your bill

Per-model token pricing. Do the math yourself.

Self-serve credit is metered per million tokens, per model. Input tokens are what you send. Output tokens are what the model returns. Same OpenAI-style schema as the SDK you already use.

Model Input (per 1M tok) Output (per 1M tok) vs incumbent
Qwen3-0.6B $0.05 $0.10 −67%
Qwen3-8B $0.15 $0.60 parity
Mistral-7B-v0.3 $0.15 $0.30 −25%
Qwen3-14B $0.20 $0.80 parity
Mixtral-8x7B $0.22 $0.70 −8%
Hermes-3-405B (1.0066× PPL vs bf16 · Max 20× / Team / Enterprise) $2.50 $2.50 −44% vs Together

Quick math: 1,000 calls to Qwen3-8B with 500 input + 200 output tokens each = 500K input + 200K output = $0.075 + $0.12 = $0.20 total. Your $5 free credit covers ~25K calls of that shape. Pro's $20/mo covers typical individual workloads. Max 5× at $100/mo covers a production app. Hermes-3-405B is the largest verified architecture; available on Max 20×, Team, and Enterprise.

# Drop-in OpenAI SDK swap. Same code, new base_url.
from openai import OpenAI

client = OpenAI(
    base_url="https://api.sipsalabs.com/v1",
    api_key="sk-...",  # your Sipsa key
)

resp = client.chat.completions.create(
    model="sipsa-qwen3-0.6b",
    messages=[{"role": "user", "content": "hello, compressed world"}],
)
print(resp.choices[0].message.content)
/ How does this compare?

Mapped to what you’re already paying. Pro tier vs the alternatives.

Most teams are already paying somebody for inference — OpenAI, a cloud GPU rental, or their own AWS bill. Here’s what the same monthly traffic costs on Sipsa versus the three most common alternatives.

vs OpenAI gpt-4o-mini

1M input tokens (OpenAI)$0.15
1M input tokens (Sipsa / Qwen3-8B)$0.15
Reproducibly reconstructible weightsYes
Same on OpenAI?No (closed)

Token cost is at parity. Sipsa adds a verifiable deployment-integrity contract — the served weights are provably the validated artifact. If you need FDA, SR-11-7, or DoD audit of the served weights, OpenAI is not viable. Sipsa is.

vs self-hosting on Lambda H100

H100 spot, 24×7 (~$2/hr)~$1,440/mo
vLLM on Qwen3-8B @ ~10 RPS~equivalent throughput
Sipsa Max 5× base$100/mo
+ included 5× Pro quotano top-up needed

Self-hosting wins above ~$1K/mo equivalent traffic. Sipsa wins below it — you skip GPU procurement, vLLM ops, and the on-call rotation. Same OpenAI-compatible SDK either way.

vs vLLM on AWS p5.48xlarge

p5 spot, 24×7 (~$30/hr)~$21,600/mo
Hermes-3-405B servingmoderate throughput
Sipsa Enterprise (405B included)contract
Typical pricing30–40% of compute savings

Want 405B-class quality? Sipsa Enterprise pricing scales with the savings it unlocks — typically 30–40% of what you'd spend running Hermes-3-405B yourself on AWS at the same throughput cap.

/ Estimate your bill

Plug in your traffic. See what you’d actually pay.

Move the sliders to match your real workload. We compute the monthly bill on Sipsa, what the same traffic would cost at OpenAI rates, and what self-hosting on a rented H100 would run. Pure JavaScript — nothing leaves your browser.

1,000 requests/day
500 tokens (split 2:1 in:out)

Your monthly bill

Monthly volume
Sipsa Pro ($20 base)
OpenAI equivalent
Self-host (Lambda H100, 24×7)
You save vs self-host

All numbers are estimates. Sipsa Pro is $20/mo base; Max 5× is $100/mo (5× quota); Max 20× is $200/mo (20× quota + 405B); Team is $25/seat/mo (5 seat min). Overage is metered at the per-model rates shown above. Top up with $25 / $100 / $500.

/ FAQ

Honest answers. The questions every buyer asks.

If your question isn't here, email founder@sipsalabs.com — the founder reads it.

What does “near-lossless” actually mean?
Reproducible, cryptographically verifiable reconstruction of the compressed weights, verified by an SHA-256 manifest shipped with every artifact. The compressed pack reproduces the same numerical values recorded at pack time — every load, on any hardware. The 5-bit compression itself is lossy with respect to the original bf16 weights; what is deterministic is the reconstruction. End-to-end inference behavior matches the bf16 baseline up to fp16 reduction-order on the matmul itself. Across 22 PPL-verified architectures (17 dense + 4 MoE + 1 SSM) plus 1 ViT cosine-verified, the perplexity ratio falls between 1.0013× and 1.0125× vs the original weights. See the verified matrix →
Is this OpenAI-API-compatible?
Yes. Set OPENAI_BASE_URL=https://api.sipsalabs.com/v1 and your existing openai SDK code keeps working. Same chat.completions schema, same message format, same streaming, same error envelope. We're a drop-in for self-hosted inference — not a replacement for ChatGPT's product surface.
How does the annual discount work?
Annual saves 15–20% off the monthly rate. Pro $204/yr ($17/mo effective), Max 5× $1,000/yr ($83/mo effective), Max 20× $2,000/yr ($167/mo effective), Team $240/seat/yr ($20/seat/mo effective). Toggle "Annual" at the top of the page to see the effective monthly rate. Annual customers can still purchase top-ups during the term.
What is the difference between Max 5× and Max 20×?
Both are individual tiers. Max 5× ($100/mo) gives you 5× the Pro quota with highest-priority queue and a 99.5% SLA. Max 20× ($200/mo) gives you 20× the Pro quota, plus 405B-flagship access (Hermes-3-405B, 1.0066× PPL vs bf16) and per-request audit logs. If you need the 405B class or audit logs, that's Max 20×.
How does Team per-seat billing work?
Team is $25/seat/month (minimum 5 seats). Each seat gets Max-5×-level quota and 405B-flagship access. Central billing, SSO, admin controls, and per-request audit logs are included. Add or remove seats at any time — billing is prorated. Annual plan: $240/seat/year ($20/seat/mo effective).
Can I use my own fine-tuned model?
Yes — via the Compression-as-a-Service path on the Enterprise tier. You hand us the safetensors of your fine-tune; we compress it to the same near-lossless-quality 5-bit substrate your stock models run on; you get back a SHA-256-verified artifact that drops into the same OpenAI-compatible endpoint. Pricing is part of the Enterprise contract. The $5K Phase 0 POC is the on-ramp if you want to evaluate first.
How is this different from AWQ / GPTQ / GGUF?
Two ways. First, our reconstruction is reproducible and cryptographically verifiable — the SHA-256 manifest proves the dequantized tensor is the artifact you validated, every load, on any hardware. (Like AWQ / GPTQ / GGUF, the 5-bit compression itself is lossy with respect to the original bf16 weights; the difference is the verifiable reconstruction contract, not the loss.) Second, we publish a verifier: pip install ultracompress && uc verify reproduces the SHA-256 contract on your hardware. Your security team can audit the codec independently. Other quantizers don't ship that; they ship the artifact and ask you to trust the loss is acceptable.
Do you train on my data?
No. We don't log prompt or completion content. Metered billing uses token counts only. See /privacy for the full data-handling policy. On-prem deploys (Enterprise) never touch our infrastructure at all — inference data stays inside your boundary by construction.
What happens if I exceed my monthly quota?
On Free tier, calls return HTTP 402 ("payment required") cleanly — no charges, no surprise bill. On paid tiers, you can buy a top-up ($25 / $100 / $500) which extends your monthly quota and rolls over for 12 months. Top-ups are the way to handle traffic spikes without upgrading your recurring tier mid-month. We will not silently over-charge a card.
How is Enterprise priced?
Value-based. Typical structure: 30–40% of your monthly compute savings, paid as a flat monthly fee with a dedicated SLA. Enterprise adds per-deployment-attestation pricing for regulated production (FDA SaMD, SR 11-7, DoD ATO). The entry point is the $5K Phase 0 POC — we compress your model, deliver a signed reconstruction audit, and scope the contract from a measured savings baseline.
What's the refund policy?
Self-serve subscriptions (Pro, Max, Team) are pro-rated refundable for the first 14 days, no questions asked. After 14 days, cancel any time and we won't bill the next cycle. Top-up credits roll over for 12 months from purchase. Annual plans are pro-rated refundable for 14 days; after that, the annual commitment runs to term. Enterprise contracts have refund terms specified in the MSA.
Who pays for the GPU?
We do, on all self-serve tiers (Free through Team). Inference runs on Sipsa Labs infrastructure (served from Sipsa-managed GPU capacity behind a Cloudflare-fronted edge; capacity scales with reserved-instance demand). Per-model token rates above are what we charge; what we pay for the GPU is on us. On Enterprise on-prem, you pay for your own GPU because the model runs in your VPC — that's the whole point of the security boundary.
/ Trust & verification

Every claim is checkable. By you.

We don't ask you to take our word for any of this. The verifier is public, the artifacts are open, and the billing meter is transparent.

SHA-256 verifiable artifacts Open verifier (pip) No training on customer data Stripe payments 23 architectures (22 PPL + 1 ViT cosine) ITAR-aware on-prem path

Full security posture and compliance status: /security. Compliance roadmap (SOC 2 / SR-11-7 / FDA / FedRAMP): /enterprise#compliance.

/ Get started

Pick a path. Ship today.

The fastest way to know if Sipsa Inference fits your workload is to run it. Three seconds to a key, no card.

Start with $5 free credit. No card →

Or skip straight to Pro ($20/mo) / Max 5× ($100/mo) / Max 20× ($200/mo) / Team ($25/seat) via Stripe. For Enterprise contracts, start with the Phase 0 scope form.

Get my free key
3-second signup · $5 free credit · No card · Cancel anytime