Live · streaming · OpenAI-compatible

Try Sipsa-compressed inference live.

You're about to talk to a 5-bit-compressed transformer running on a single RTX 5090. Same OpenAI API your code already speaks — weights are about 5× smaller and reconstruct bit-identically under uc verify.

Bring your own Sipsa key. The demo POSTs to a same-origin proxy so your key never leaves the request and is never logged.

API key

No key? Sign up for $5 free credit — no card. Saved locally in your browser only.

Model

Default: Qwen3-1.7B-Base (smallest, fastest cold-start). All models served at 5 bits per weight.

Prompt

Cmd/Ctrl + Enter to submit. Output streams token-by-token.

Endpoint POST /v1/chat/completions · stream: true · max_tokens 400

Streamed response idle

Output will appear here once you click Generate.

Time to first token--

Total time--

Tokens out--

Tokens / second--

The same call from your terminal

This is real inference against a Sipsa-compressed model — not a recording. The compressed weights pass uc verify SHA-256 reconstruction. Inspect the model on HuggingFace, or read the full benchmark matrix at /inference.

Demo proxy strips cookies and forwarded headers, never logs your key, and is rate-limited to 60 req/min per IP. Source code: github.com/sipsalabs.