Sipsa Inference is live.

An OpenAI-compatible inference API at api.sipsalabs.com/v1 serving the same 22 lossless 5-bit transformer architectures we ship to the public HuggingFace Hub. Drop-in replacement for the official openai SDK. Live today.

UltraCompress · 2026-05-11 · Posted by the Sipsa Labs team

22
Architectures served
405B
Largest model (Hermes-3)
1.0040x
Tightest PPL ratio
200 OK
Live status

For most of the past six months, the Sipsa Labs work happened on the trainer side: how to compress frontier-scale transformers from the bf16 published weights into a SHA-256-verifiable 5-bit format that loses no quality the customer can measure. We shipped 40 of those artifacts to HuggingFace, made them pip install-reproducible end-to-end, and kept the perplexity ratios under 1.013x across the full 22-architecture matrix.

Today we shipped the other half. api.sipsalabs.com/v1 is an OpenAI-compatible inference API that serves those compressed artifacts behind the same surface that the openai Python SDK already speaks. Drop-in. No code change at the customer side except an environment variable.

What's served as of today

The /v1/models endpoint returns the live model list. Today's roster:

$ curl https://api.sipsalabs.com/v1/models -H "Authorization: Bearer $SIPSA_API_KEY"

{
  "object": "list",
  "data": [
    {"id": "hermes-3-405b",   "object": "model", "owned_by": "sipsalabs"},
    {"id": "mixtral-8x7b",    "object": "model", "owned_by": "sipsalabs"},
    {"id": "phi-3.5-moe",     "object": "model", "owned_by": "sipsalabs"},
    {"id": "qwen3-14b",       "object": "model", "owned_by": "sipsalabs"},
    {"id": "qwen3-8b",        "object": "model", "owned_by": "sipsalabs"},
    ...
  ]
}

The full 22-architecture matrix — including the three new sub-1.005x records this week (Mixtral-8x7B at 1.00368x, Qwen3-14B at 1.00403x, Mistral-7B-v0.3 at 1.00548x) — is rolling onto the API as we tier-promote artifacts from the staging bench to the production fleet. Public benchmark dashboard at /inference.

Drop-in for the openai SDK

The whole point of an OpenAI-compatible API is that you don't rewrite your stack to use it. Set the base URL, point at our endpoint, the rest of your code stays the same:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.sipsalabs.com/v1",
    api_key=os.environ["SIPSA_API_KEY"],
)

response = client.chat.completions.create(
    model="hermes-3-405b",
    messages=[{"role": "user", "content": "hello, lossless world"}],
)
print(response.choices[0].message.content)

Same for streaming, function-calling, and the other endpoints the openai SDK speaks. We're shipping the same routes the SDK calls; the model behavior on our side is the bf16-equivalent output of the compressed artifact, verified bit-identical at load time via SHA-256 manifest.

Why bit-identical matters for serving

Most public quantization techniques (AWQ, GPTQ, EXL3, QTIP, SeedLM) regenerate quantizer state at load time, which means the weights the customer's inference call actually executes against are not byte-identical to the weights the implementer evaluated during qualification. That's typically a 2-10% perplexity drift between training-time eval and customer inference.

The Sipsa v3 binary format persists every byte of quantizer state inside the pack. Result: the customer-loaded model produces hidden states that are bit-identical (within fp32 numerical noise) to the trainer's reference. The PPL number on /inference is the PPL the customer gets in production. No drift, no quality cliff between dev and deploy.

For regulated workloads — FDA-grade clinical AI, SR-11-7 financial models, defense edge inference — that bit-identical guarantee is the regulatory-equivalence floor. It's what makes the deployed binary auditable as the same model that passed qualification.

Pricing & access

Tiers and per-token pricing are at /pricing. The first $5 of usage on every account is on us — ping founder@sipsalabs.com for an API key. Compression-as-a-Service (custom architecture compression for your workload) and on-prem MSA contracts are available for production-tier deployments.

The substrate is open: pip install ultracompress (PyPI v0.6.2 under BUSL-1.1 + Additional Use Grant — free for sub-$1M ARR companies, research, and individuals). All 40 compressed artifacts are at huggingface.co/SipsaLabs for self-host. The hosted API is the easy-button path.


Sipsa Labs is an experimental and deep tech-and-software company. UltraCompress is the first publicly-shipped product. Sipsa Inference is the second. More products in flight. USPTO Provisionals 64/049,511 + 64/049,517 (filed 2026-04-25). Live on Hacker News today.