Blog

Engineering notes, lab discoveries, and shipping milestones from Sipsa Labs — the active Sentio program plus the archived compression era. We tell you which numbers we have measured, when, with what conditions — and we do not tell you the others.

2026-06-10 · SENTIO

Hearing what radios can't: why fiber-guided drones force a sensing rethink

The most consequential small drone on the battlefield carries its control link on a spool of optical fiber — no radio, so RF-based counter-drone systems are blind to it by construction. A plain-language note on the threat, why passive acoustics is the cue that can't be turned off, how sound-cues-sight-confirms covers each sensor's failure mode, and exactly where our work stands: end-to-end on edge hardware, sim-validated, field campaign next.

2026-05-29

"100T on one GPU" — what's actually loaded where, honestly

When we say a 405B-parameter model runs on a single 32 GB consumer GPU, we mean streamed from durable storage during reconstruction, not resident in VRAM. Full VRAM residency is information-theoretically impossible at that scale. The engineering claim is still real; here's the honest picture of what gets moved where, and why the SHA-256 reconstruction contract still holds end-to-end.

2026-05-29

The first near-lossless 5-bit state-space model — Mamba-2.8B at 1.00593× canonical PPL

Mamba-2.8B-hf joined the public registry as the 22nd PPL-verified pack and the first state-space model packed end-to-end with a canonical perplexity measurement. The headline number is 1.00593× against an architecture-compatible bf16 reference, with a comparator caveat we document in plain sight and are writing a separate methodology note about.

2026-05-28

What landed overnight at Sipsa Labs: 22 PPL-verified, Vision Transformers, Apple Silicon, and an audit primitive

Tuesday into Wednesday: two more architectures graduated (TinyLlama 1.00317× = 3rd-tightest in registry, Llama-3.1-70B 1.00898×), the codec generalized to Vision Transformers + audio + diffusion (with a 1-week format extension), the Apple Silicon mlx-lm path went from "no" to "0.9966 forward logit cosine," and the uc audit customer-side audit-receipt primitive shipped in the public package. DINOv2-Large is live as the first non-LLM pack at huggingface.co/SipsaLabs.

2026-05-28

Compressed 8B packs cold-start in ~1/3 the vLLM load time — throughput unchanged

Phase 2 benchmark on RTX 5090: an UltraCompress pack of Qwen3-8B or Llama-3.1-8B loads into vLLM in about 45 seconds versus about 145 seconds for the bf16 baseline. Decode tokens-per-second across batch sizes 1, 4, and 8 is parity within measurement noise. Cold-start cost drops by roughly two thirds; hot-path performance does not move.

2026-05-28

The reconstruction-overhead attack — and where the ceiling lands

Yesterday's vLLM throughput-parity post had one honest weakness: the pack reconstructs to bf16 once at load time, and that one-time cost is non-trivial — roughly 203 seconds for Qwen3-8B today. This is the post where we tell you what we measured, where the hotspot lives, and what the realistic ceiling looks like once we ship the parallelization-aware path.

2026-05-27

A real engineering day: 2.69× VRAM reduction, the MoE mechanism, and a wall that keeps standing

Three results from one Tuesday: a 2.69× VRAM reduction on Qwen3-8B at bit-exact PPL match (streamed weight-loading on a single 32 GB GPU), the mechanism for why MoE compresses tighter than dense (publishable finding), and the 6th refuted attempt at the Llama-3.1-8B 1.0125× floor. Two wins, one honest negative, every shipped pack download-verifiable in seconds via uc verify.

2026-05-27

DoD ATO and the deployed-equals-accredited check: what compression breaks and how to fix it

The DoD Risk Management Framework expects the AI model on the edge platform to be the model the accreditation board reviewed, byte for byte. Standard 4-bit quantization breaks that link — the SI-7 control becomes documentary fiction. Reproducible 5-bit reconstruction with a SHA-256 manifest restores the deployed-equals-accredited primitive cryptographically.

2026-05-27

Why we filed comments on FDA-2026-N-4390

Sipsa Labs is a small, pre-revenue, deep-tech company. We are also the kind of company that writes formal public comments to the U.S. Food and Drug Administration. This post explains why those two facts are not in tension — and what we argued the early-phase clinical-trial AI pilot docket should do.

2026-05-25

Model-risk guidance and the deployment-validation gap quantization opens

The revised interagency model-risk guidance (OCC Bulletin 2026-13, which superseded SR 11-7 on April 17, 2026) expects the model in production to be the model that was validated. Standard 4-bit quantization breaks that link silently — the deployed weights are a numerical approximation of the validated ones, drifting with kernel, CUDA, and GPU generation. Cryptographic reconstruction with a SHA-256 manifest turns deployed-equals-validated into a one-line audit primitive.

2026-05-23

How FDA SaMD reviews handle quantized AI models — and the audit gap nobody talks about

FDA's December 2024 PCCP final guidance pins cleared AI to a specific validated artifact. Standard quantization — AWQ, GPTQ, EXL3, bnb-nf4 — makes the deployed weights numerically different from the validated ones, and model-signing proves only the file at rest, not that the deployed weights still match. Here is the gap, and what a regulated team can do this quarter.

2026-05-22

Can you prove the model in production is the one you validated?

Model-governance frameworks — the interagency model-risk guidance (OCC Bulletin 2026-13, successor to SR 11-7), device clearance, airworthiness — all assume the deployed model is the validated model. Compression makes that harder, not easier. Reproducible 5-bit reconstruction plus SHA-256 turns deployed-equals-validated into a one-line cryptographic check.

2026-05-15

v0.6.9: Security release + new Mistral 7B 1.00548x record

Closes an RCE-class deserialization vulnerability across all six customer-facing load sites. v0.6.7 and v0.6.8 yanked from PyPI. Mistral-7B-v0.3 hits 1.00548x — tightest dense 7B-class near-lossless 5-bit number published.

2026-05-12

Architecture-specific floors: why we publish negative results

13 of 14 PPL-verified architectures land sub-1.01×. Llama-3.1-8B sits at 1.0125× — and we ran three independent cure experiments to establish that this is the architecture-specific floor, not a bug in our method. Honest negative results are a competitive moat, so we publish them.

2026-05-11

Sipsa Inference is live.

OpenAI-compatible inference API at api.sipsalabs.com/v1 serving the 19 publicly served packs of the 22 PPL-verified catalog (3 flagships gated:manual under MOAT policy; Hermes-3-Llama-3.1-405B available on request at 1.0066x PPL ratio). Drop-in replacement for the official openai SDK. First $5 of usage on us.

2026-05-09

We searched for our competition. The 5-bit band was empty.

A live HuggingFace Hub query for near-lossless 5-bit transformer compression returned zero competing artifacts. Twenty-two architectures, three new sub-1.005x records this week, SHA-256 verifiable, reproducible reconstruction.

2026-05-09

Hermes-3-405B compressed at 5 bits, near-lossless quality.

Largest dense transformer artifact published to HuggingFace at 5-bit near-lossless quality. Compressed end-to-end on dual RTX 5090s in 13 hours of wall clock. 251 GB pack on disk. Verified perplexity ratio 1.0066x.

2026-05-08

Eighteen architectures, one pack format.

UltraCompress 5-bit near-lossless transformer compression validated across 18 architectures from 0.6B to 405B parameters — dense, mixture-of-experts, and state-space (Mamba). One pack format. Reproducible reconstruction. (As of May 2026; now 23 verified architectures.)