Why we put the inference API in private beta first

/ 2026-05-12 · Sipsa Labs · ~5 min read

Yesterday's plan was to ship Sipsa Inference as an OpenAI-compatible drop-in replacement, public from day one, "first $5 of usage on us, just hit the endpoint." A friend hit the endpoint late last night and got back a debug placeholder where the model output should have been.

That kind of breakage is the most common failure mode in solo-founder launches: the substrate is real, the GPU is real, the model artifacts are real, but the customer-facing wire from "user pasted my OPENAI_BASE_URL" to "real tokens come back" hadn't been wired end-to-end yet. The mock returned a placeholder. The placeholder said "vLLM not wired yet." A real customer wouldn't read it as humorous.

The right response is not to hide the breakage. The right response is to label what's actually shipped, label what's actually in beta, and route customers to the path that works today.

What's actually production today

The substrate is fully production. pip install ultracompress, pull any of 40 customer-side reproducible artifacts from huggingface.co/SipsaLabs, run locally with the standard PyTorch path:

pip install ultracompress
hf download SipsaLabs/qwen3-8b-uc-v3-bpw5 --local-dir ./qwen3-8b
uc verify ./qwen3-8b   # confirms bit-identical reconstruction
uc bench ./qwen3-8b    # TTFT / tokens/sec / VRAM

22 architectures verified end-to-end. SHA-256 bit-identical reconstruction. The PPL ratios on /inference are the PPL ratios the customer measures locally. Hermes-3-Llama-3.1-405B at 1.0066×. Mixtral-8x7B at 1.00368×. The full matrix is published. Nothing about the substrate is in beta.

What's in private beta

The managed API at api.sipsalabs.com/v1 is the convenience surface. It serves the same compressed artifacts behind an OpenAI-compatible endpoint so customers don't have to self-host. The substrate behind the API is production-grade. The serving layer is in private beta.

"Private beta" here means a specific, not-vague thing:

1. Capacity scales carefully

Sipsa runs on real GPU infrastructure. Right now that means our two RTX 5090s in Aurora behind a Cloudflare Tunnel, with hot-swappable scale-up paths to Lambda Labs / Modal / RunPod H100 capacity for any customer who needs more than the home-cluster can sustain. Onboarding customers in batches lets me watch every customer's request pattern, tune the routing, and decide when to spin up paid cloud capacity for which model. The alternative — opening the firehose on day one — means either dropping packets or pre-spending on cloud GPU before customer demand justifies it. Neither is good for a six-month-old solo-founder operation that's spent ~$5K total to get here.

2. High-density customer conversations

The first 10 conversations with real beta customers are worth more than the next 1000 anonymous API calls. Every early-beta conversation tells me which architectures customers actually want compressed next (today the answer is >100B-class MoE), which pricing tier matters (today the answer is "On-Prem MSA for the regulated stuff, Self-Serve $5 for the indie hackers"), which compliance feature is the deal-closer (today the answer is bit-identical reconstruction for the SR-11-7 + FDA crowd). Private beta forces a 24-hour-turnaround direct-to-founder onboarding email. That direct line is the highest-bandwidth customer-discovery channel I have.

Once the substrate-roadmap and the pricing model are anchored to real customer signal across enough beta accounts, the API opens. Not before.

What this means for you

If you want to use Sipsa today, two paths, both available now:

Self-host the substratepip install ultracompress, pull any of 40 model artifacts from huggingface.co/SipsaLabs, run on whatever GPU you have. No API key, no waitlist, no rate limit. BUSL-1.1 + Additional Use Grant means free for sub-$1M ARR companies, research, and individuals.

Get a private-beta API key — email founder@sipsalabs.com with a one-line description of your use case and I'll provision a key within 24 hours. First $5 of usage on every approved account is on us. Or use the structured intake form at /get-access — it pre-fills a template.

Honest aside

I'd rather ship something narrow that works than something wide that pretends to. The "API live, drop-in, $5 free, just paste your OPENAI_BASE_URL" framing was my error in yesterday's launch copy. The mock returning a placeholder when a real customer hit it is on me. Today's correction: the substrate is production today, the API is private beta with a 24-hour-turnaround human-onboarding loop. Both paths are usable. Neither pretends to be something it isn't.

If you tried the API yesterday and got the placeholder, sorry — please email me directly and I'll fast-track your beta access. If you want to build on the substrate today, pip install ultracompress and reach me at founder@sipsalabs.com.

— Sip

← back to blog