Live · streaming · OpenAI-compatible

Try Sipsa-compressed inference live.

You're about to talk to a 5-bit-compressed transformer running on a single RTX 5090. Same OpenAI API your code already speaks — weights are about 5× smaller and reconstruct bit-identically under uc verify.

Bring your own Sipsa key. The demo POSTs to a same-origin proxy so your key never leaves the request and is never logged.

No key? Sign up for $5 free credit — no card. Saved locally in your browser only.
Default: Qwen3-1.7B-Base (smallest, fastest cold-start). All models served at 5 bits per weight.
Cmd/Ctrl + Enter to submit. Output streams token-by-token.
Endpoint POST /v1/chat/completions · stream: true · max_tokens 400
Streamed response idle
Output will appear here once you click Generate.
Time to first token--
Total time--
Tokens out--
Tokens / second--
The same call from your terminal

    

This is real inference against a Sipsa-compressed model — not a recording. The compressed weights pass uc verify SHA-256 reconstruction. Inspect the model on HuggingFace, or read the full benchmark matrix at /inference.