Sipsa Inference: API live.
An OpenAI-compatible inference API at api.sipsalabs.com/v1 serving the same 22 near-lossless 5-bit transformer architectures we ship to the public HuggingFace Hub. Drop-in replacement for the official openai SDK. Publicly self-serve — subscribe Pro $20/mo, Max from $100/mo, or Team $25/seat/mo at sipsalabs.com/pricing, or grab free $5 credits (no card) at sipsalabs.com/get-access. The full pip install ultracompress substrate is fully production today (no API key required for self-host).
api.sipsalabs.com/v1 is now publicly self-serve. Pro $20/mo, Max from $100/mo, and Team $25/seat/mo at sipsalabs.com/pricing; free $5 credits with no card at sipsalabs.com/get-access. The original launch post (below) is preserved for the historical record.
For most of the past six months, the Sipsa Labs work happened on the compression side: how to compress frontier-scale transformers from the bf16 published weights into a SHA-256-verifiable 5-bit format that loses no quality the customer can measure. We shipped 40 of those artifacts to HuggingFace, made them pip install-reproducible end-to-end, and kept the perplexity ratios under 1.013x across all 14 PPL-verified records (of 22 shipped architectures).
Today we opened the other half. api.sipsalabs.com/v1 is an OpenAI-compatible inference API that serves those compressed artifacts behind the same surface that the openai Python SDK already speaks. Drop-in. No code change at the customer side except an environment variable. The API is publicly self-serve: subscribe Pro $20/mo, Max from $100/mo, or Team $25/seat/mo at sipsalabs.com/pricing, or grab free $5 credits with no card at sipsalabs.com/get-access.
What's served as of today
The /v1/models endpoint returns the live model list. Today's roster:
$ curl https://api.sipsalabs.com/v1/models -H "Authorization: Bearer $SIPSA_API_KEY"
{
"object": "list",
"data": [
{"id": "sipsa-hermes-3-llama-3.1-405b", "object": "model", "owned_by": "sipsalabs"},
{"id": "sipsa-mixtral-8x7b", "object": "model", "owned_by": "sipsalabs"},
{"id": "sipsa-phi-3.5-moe", "object": "model", "owned_by": "sipsalabs"},
{"id": "sipsa-qwen3-14b", "object": "model", "owned_by": "sipsalabs"},
{"id": "sipsa-qwen3-8b", "object": "model", "owned_by": "sipsalabs"},
...
]
}
The full 22-architecture shipped matrix (14 PPL-verified end-to-end) — including the three new sub-1.005x verified records this week (Mixtral-8x7B at 1.00368x, Qwen3-14B at 1.00403x, Mistral-7B-v0.3 at 1.00548x) — is rolling onto the API as we tier-promote artifacts from the staging bench to the production fleet. Public benchmark dashboard at /inference.
Drop-in for the openai SDK
The whole point of an OpenAI-compatible API is that you don't rewrite your stack to use it. Set the base URL, point at our endpoint, the rest of your code stays the same:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.sipsalabs.com/v1",
api_key=os.environ["SIPSA_API_KEY"],
)
response = client.chat.completions.create(
model="sipsa-qwen3-0.6b", # instant. 70B and 405B available, see /pricing
messages=[{"role": "user", "content": "hello, compressed world"}],
)
print(response.choices[0].message.content)
Same for streaming, function-calling, and the other endpoints the openai SDK speaks. We're shipping the same routes the SDK calls; the model behavior on our side is the bf16-equivalent output of the compressed artifact, verified reproducible at load time via SHA-256 manifest. Model IDs on the API are prefixed sipsa-* — full list at /v1/models.
Why reproducibility matters for serving
Most public quantization techniques (AWQ, GPTQ, EXL3, QTIP, SeedLM) regenerate quantizer state at load time, which means the weights the customer's inference call actually executes against are not byte-identical to the weights the implementer evaluated during qualification. That's typically a 2-10% perplexity drift between training-time eval and customer inference.
The Sipsa v3 binary format persists every byte of quantizer state inside the pack. Result: the customer-loaded weights are byte-identical to the weights the implementer evaluated during qualification — provable with a single SHA-256 manifest check on the customer's own hardware. The PPL number on /inference is the PPL the customer gets in production. No drift, no quality cliff between dev and deploy.
For regulated workloads — FDA-grade clinical AI, SR-11-7 financial models, defense edge inference — that reproducibility property is the regulatory-equivalence floor. It's what makes the deployed binary auditable as the same model that passed qualification.
Pricing & access
Tiers and per-token pricing are at /pricing: Pro is $20/mo, Max is $100–$200/mo, Team is $25/seat/mo, all self-serve. Every account gets the first $5 of usage on us — sign up at sipsalabs.com/get-access with no card required. Compression-as-a-Service (custom architecture compression for your workload) and on-prem MSA contracts are available for production-tier deployments.
The substrate is open: pip install ultracompress (PyPI v0.6.27 under BUSL-1.1 + Additional Use Grant — free for sub-$1M ARR companies, research, and individuals). All 40 compressed artifacts are at huggingface.co/SipsaLabs for self-host. The hosted API is the easy-button path.
Sipsa Labs is an experimental and deep tech-and-software company. UltraCompress is the first publicly-shipped product. Sipsa Inference is the second. More products in flight.