Live · streaming · OpenAI-compatible
Try Sipsa-compressed inference live.
You're about to talk to a 5-bit-compressed transformer running on a single RTX 5090. Same OpenAI API your code already speaks — weights are about 5× smaller and reconstruct bit-identically under uc verify.
Bring your own Sipsa key. The demo POSTs to a same-origin proxy so your key never leaves the request and is never logged.
No key? Sign up for $5 free credit — no card. Saved locally in your browser only.
Default: Qwen3-1.7B-Base (smallest, fastest cold-start). All models served at 5 bits per weight.
Cmd/Ctrl + Enter to submit. Output streams token-by-token.
Streamed response
Output will appear here once you click Generate.
Time to first token--
Total time--
Tokens out--
Tokens / second--
The same call from your terminal
This is real inference against a Sipsa-compressed model — not a recording. The compressed weights pass uc verify SHA-256 reconstruction.
Inspect the model on HuggingFace,
or read the full benchmark matrix at /inference.
Demo proxy strips cookies and forwarded headers, never logs your key, and is rate-limited to 60 req/min per IP. Source code: github.com/sipsalabs.