Free credit to start, no card. Pro from $20/mo, Max from $100/mo, Team from $25/seat/mo. Enterprise contracts for regulated, dedicated, or on-prem deployments. Same OpenAI-compatible SDK on every tier.
1.0066×
Hermes-3-405B PPL ratio on a single 32 GB consumer GPU
Archive notice — this page is preserved for historical reference. The Sipsa Inference service was discontinued in June 2026 and nothing here can be purchased.
/ Self-serve plans
Four tiers. Pick the one that fits.
Every tier ships the same OpenAI-compatible API and the same SHA-256-verifiable substrate. What changes: monthly throughput, model catalog access, and seat count. Cancel any time.
For dedicated capacity, SOC 2 + audit support, on-prem appliances, or FDA SaMD / SR 11-7 / DoD ATO compliance work. Contracts are value-based and structured around the compute savings the deployment unlocks.
Enterprise
Custom
Sales-led · dedicated capacity · on-prem
For organizations that need dedicated capacity, compliance support, air-gapped deployments, or Compression-as-a-Service for custom fine-tunes.
Typical pricing: 30–40% of your monthly compute savings, paid as a flat monthly fee with a dedicated SLA, plus per-deployment-attestation pricing for regulated production.
Need a one-time evaluation? The Phase 0 POC at sipsalabs.com/poc is a fixed-fee $5K engagement: your model, your hardware, five business days, one signed reconstruction audit report. Use it to evaluate UltraCompress before you commit to Enterprise.
/ One-time top-ups
Hit a spike? Top up without upgrading.
Top-ups apply to any plan. Credits roll over for 12 months. Use top-ups for traffic spikes instead of upgrading the recurring tier mid-month.
$25
Cover a spike
$100
Pro overage · first-purchase from Free
$500
Production spike · bulk credit
/ Compare tiers
Side by side. No surprises.
Every line item below is what you actually get when you sign up — not a marketing rounded-up version.
Feature
Free
Pro
Max 5×
Max 20×
Team
Enterprise
Monthly price
$0
$20
$100
$200
$25/seat
Custom
Annual price
—
$204/yr
$1,000/yr
$2,000/yr
$240/seat/yr
Custom
Free $5 credit on signup
✓
✓
✓
✓
✓
n/a
Monthly quota (vs Pro baseline)
100/day
1×
5×
20×
5× per seat
contract
8 free models
✓
✓
✓
✓
✓
✓
Full catalog (all served models)
—
✓ (except 405B)
✓ (except 405B)
✓
✓
✓
405B-flagship (Hermes-3-405B)
—
—
—
✓
✓
✓
Priority queue
—
✓
highest
highest
highest
dedicated
Seats
1
1
1
1
5–150
custom
SSO + admin
—
—
—
—
✓
✓
Audit logs (per-request)
—
—
—
✓
✓
✓
SLA on inference availability
best-effort
best-effort
99.5%
99.5%
99.5%
contract
Support channel
community
email 1bd
email 1bd
email 1bd
email 1bd
24/7 + founder
SOC 2 questionnaire support
—
—
—
—
—
✓
On-prem / air-gapped deploy
—
—
—
—
—
✓
FDA SaMD / SR 11-7 / DoD ATO
—
—
—
—
—
✓
/ What you actually get
In plain English. One paragraph per tier.
Cards and tables are useful. So is a normal paragraph. Here's what each plan looks like the day after you sign up.
Free$5 credit
You sign up at /get-access with an email, no card. We hand you an API key in three seconds and credit your account with $5. You can call any of the 8 free models (Qwen3-0.6B, Qwen3-1.7B, Qwen3-1.7B-Base, TinyLlama-1.1B, SmolLM2-1.7B, OLMo-2-1B, Mamba-2.8B, Phi-3-mini-4k) with 100 requests per day. Larger models like Qwen3-8B, Mistral-7B-v0.3 and Yi-1.5-9B are request / paid-tier — run uc catalog for the live free/request/POC tier of every model. The dashboard shows your credit burn-down in real time. When the $5 runs out, calls return a 402 instead of a Stripe surprise. Use this to validate the SDK swap and benchmark a model against your real workload.
Pro$20 / mo
The main individual tier. You get a generous monthly request quota, access to the full served catalog except the 405B-flagship class, and priority over Free traffic. One-business-day email support. Top up with $25 / $100 / $500 if you blow through your quota. Cancel any time — or save 15% with the annual plan ($204/yr, $17/mo effective).
Max 5×$100 / mo
Five times the Pro quota. Everything in Pro plus highest-priority queue and a 99.5% SLA on inference availability. For production apps that have graduated from prototyping and need predictable throughput. Annual: $1,000/yr ($83/mo effective).
Max 20×$200 / mo
Twenty times the Pro quota, plus 405B-flagship access (Hermes-3-405B, 1.0066× PPL) and per-request audit logs. The top individual tier. Annual: $2,000/yr ($167/mo effective).
Team$25 / seat / mo
Everything in Max 5×, plus 405B-flagship access, multi-seat central billing, SSO, admin controls, and per-request audit logs. 5–150 seats. Pick this when you have a real product, real customers, and a CTO who wants traffic to not be one of the unknowns. Annual: $240/seat/yr ($20/seat/mo effective).
Enterprisecustom
Dedicated GPU capacity, full SLA with contractual remedies, on-prem appliance option (air-gapped capable for ITAR / classified workloads), FDA SaMD / SR 11-7 / DoD ATO compliance support, dedicated security review documentation, 24/7 support with a direct line to the founder, Compression-as-a-Service for your fine-tunes, and per-deployment attestation pricing for regulated production. Talk to us through the Phase 0 scope form.
/ Estimate your bill
Per-model token pricing. Do the math yourself.
Self-serve credit is metered per million tokens, per model. Input tokens are what you send. Output tokens are what the model returns. Same OpenAI-style schema as the SDK you already use.
Model
Input (per 1M tok)
Output (per 1M tok)
vs incumbent
Qwen3-0.6B
$0.05
$0.10
−67%
Qwen3-8B
$0.15
$0.60
parity
Mistral-7B-v0.3
$0.15
$0.30
−25%
Qwen3-14B
$0.20
$0.80
parity
Mixtral-8x7B
$0.22
$0.70
−8%
Hermes-3-405B (1.0066× PPL vs bf16 · Max 20× / Team / Enterprise)
$2.50
$2.50
−44% vs Together
Quick math: 1,000 calls to Qwen3-8B with 500 input + 200 output tokens each = 500K input + 200K output = $0.075 + $0.12 = $0.20 total. Your $5 free credit covers ~25K calls of that shape. Pro's $20/mo covers typical individual workloads. Max 5× at $100/mo covers a production app. Hermes-3-405B is the largest verified architecture; available on Max 20×, Team, and Enterprise.
Mapped to what you’re already paying. Pro tier vs the alternatives.
Most teams are already paying somebody for inference — OpenAI, a cloud GPU rental, or their own AWS bill. Here’s what the same monthly traffic costs on Sipsa versus the three most common alternatives.
vs OpenAI gpt-4o-mini
1M input tokens (OpenAI)$0.15
1M input tokens (Sipsa / Qwen3-8B)$0.15
Reproducibly reconstructible weightsYes
Same on OpenAI?No (closed)
Token cost is at parity. Sipsa adds a verifiable deployment-integrity contract — the served weights are provably the validated artifact. If you need FDA, SR-11-7, or DoD audit of the served weights, OpenAI is not viable. Sipsa is.
vs self-hosting on Lambda H100
H100 spot, 24×7 (~$2/hr)~$1,440/mo
vLLM on Qwen3-8B @ ~10 RPS~equivalent throughput
Sipsa Max 5× base$100/mo
+ included 5× Pro quotano top-up needed
Self-hosting wins above ~$1K/mo equivalent traffic. Sipsa wins below it — you skip GPU procurement, vLLM ops, and the on-call rotation. Same OpenAI-compatible SDK either way.
vs vLLM on AWS p5.48xlarge
p5 spot, 24×7 (~$30/hr)~$21,600/mo
Hermes-3-405B servingmoderate throughput
Sipsa Enterprise (405B included)contract
Typical pricing30–40% of compute savings
Want 405B-class quality? Sipsa Enterprise pricing scales with the savings it unlocks — typically 30–40% of what you'd spend running Hermes-3-405B yourself on AWS at the same throughput cap.
/ Estimate your bill
Plug in your traffic. See what you’d actually pay.
Move the sliders to match your real workload. We compute the monthly bill on Sipsa, what the same traffic would cost at OpenAI rates, and what self-hosting on a rented H100 would run. Pure JavaScript — nothing leaves your browser.
1,000 requests/day
500 tokens (split 2:1 in:out)
Your monthly bill
Monthly volume—
Sipsa Pro ($20 base)—
OpenAI equivalent—
Self-host (Lambda H100, 24×7)—
You save vs self-host—
All numbers are estimates. Sipsa Pro is $20/mo base; Max 5× is $100/mo (5× quota); Max 20× is $200/mo (20× quota + 405B); Team is $25/seat/mo (5 seat min). Overage is metered at the per-model rates shown above. Top up with $25 / $100 / $500.
Reproducible, cryptographically verifiable reconstruction of the compressed weights, verified by an SHA-256 manifest shipped with every artifact. The compressed pack reproduces the same numerical values recorded at pack time — every load, on any hardware. The 5-bit compression itself is lossy with respect to the original bf16 weights; what is deterministic is the reconstruction. End-to-end inference behavior matches the bf16 baseline up to fp16 reduction-order on the matmul itself. Across 22 PPL-verified architectures (17 dense + 4 MoE + 1 SSM) plus 1 ViT cosine-verified, the perplexity ratio falls between 1.0013× and 1.0125× vs the original weights. See the verified matrix →
Is this OpenAI-API-compatible?
Yes. Set OPENAI_BASE_URL=https://api.sipsalabs.com/v1 and your existing openai SDK code keeps working. Same chat.completions schema, same message format, same streaming, same error envelope. We're a drop-in for self-hosted inference — not a replacement for ChatGPT's product surface.
How does the annual discount work?
Annual saves 15–20% off the monthly rate. Pro $204/yr ($17/mo effective), Max 5× $1,000/yr ($83/mo effective), Max 20× $2,000/yr ($167/mo effective), Team $240/seat/yr ($20/seat/mo effective). Toggle "Annual" at the top of the page to see the effective monthly rate. Annual customers can still purchase top-ups during the term.
What is the difference between Max 5× and Max 20×?
Both are individual tiers. Max 5× ($100/mo) gives you 5× the Pro quota with highest-priority queue and a 99.5% SLA. Max 20× ($200/mo) gives you 20× the Pro quota, plus 405B-flagship access (Hermes-3-405B, 1.0066× PPL vs bf16) and per-request audit logs. If you need the 405B class or audit logs, that's Max 20×.
How does Team per-seat billing work?
Team is $25/seat/month (minimum 5 seats). Each seat gets Max-5×-level quota and 405B-flagship access. Central billing, SSO, admin controls, and per-request audit logs are included. Add or remove seats at any time — billing is prorated. Annual plan: $240/seat/year ($20/seat/mo effective).
Can I use my own fine-tuned model?
Yes — via the Compression-as-a-Service path on the Enterprise tier. You hand us the safetensors of your fine-tune; we compress it to the same near-lossless-quality 5-bit substrate your stock models run on; you get back a SHA-256-verified artifact that drops into the same OpenAI-compatible endpoint. Pricing is part of the Enterprise contract. The $5K Phase 0 POC is the on-ramp if you want to evaluate first.
How is this different from AWQ / GPTQ / GGUF?
Two ways. First, our reconstruction is reproducible and cryptographically verifiable — the SHA-256 manifest proves the dequantized tensor is the artifact you validated, every load, on any hardware. (Like AWQ / GPTQ / GGUF, the 5-bit compression itself is lossy with respect to the original bf16 weights; the difference is the verifiable reconstruction contract, not the loss.) Second, we publish a verifier: pip install ultracompress && uc verify reproduces the SHA-256 contract on your hardware. Your security team can audit the codec independently. Other quantizers don't ship that; they ship the artifact and ask you to trust the loss is acceptable.
Do you train on my data?
No. We don't log prompt or completion content. Metered billing uses token counts only. See /privacy for the full data-handling policy. On-prem deploys (Enterprise) never touch our infrastructure at all — inference data stays inside your boundary by construction.
What happens if I exceed my monthly quota?
On Free tier, calls return HTTP 402 ("payment required") cleanly — no charges, no surprise bill. On paid tiers, you can buy a top-up ($25 / $100 / $500) which extends your monthly quota and rolls over for 12 months. Top-ups are the way to handle traffic spikes without upgrading your recurring tier mid-month. We will not silently over-charge a card.
How is Enterprise priced?
Value-based. Typical structure: 30–40% of your monthly compute savings, paid as a flat monthly fee with a dedicated SLA. Enterprise adds per-deployment-attestation pricing for regulated production (FDA SaMD, SR 11-7, DoD ATO). The entry point is the $5K Phase 0 POC — we compress your model, deliver a signed reconstruction audit, and scope the contract from a measured savings baseline.
What's the refund policy?
Self-serve subscriptions (Pro, Max, Team) are pro-rated refundable for the first 14 days, no questions asked. After 14 days, cancel any time and we won't bill the next cycle. Top-up credits roll over for 12 months from purchase. Annual plans are pro-rated refundable for 14 days; after that, the annual commitment runs to term. Enterprise contracts have refund terms specified in the MSA.
Who pays for the GPU?
We do, on all self-serve tiers (Free through Team). Inference runs on Sipsa Labs infrastructure (served from Sipsa-managed GPU capacity behind a Cloudflare-fronted edge; capacity scales with reserved-instance demand). Per-model token rates above are what we charge; what we pay for the GPU is on us. On Enterprise on-prem, you pay for your own GPU because the model runs in your VPC — that's the whole point of the security boundary.
/ Trust & verification
Every claim is checkable. By you.
We don't ask you to take our word for any of this. The verifier is public, the artifacts are open, and the billing meter is transparent.
SHA-256 verifiable artifactsOpen verifier (pip)No training on customer dataStripe payments23 architectures (22 PPL + 1 ViT cosine)ITAR-aware on-prem path
Full security posture and compliance status: /security. Compliance roadmap (SOC 2 / SR-11-7 / FDA / FedRAMP): /enterprise#compliance.
/ Get started
Pick a path. Ship today.
The fastest way to know if Sipsa Inference fits your workload is to run it. Three seconds to a key, no card.