What landed overnight at Sipsa Labs: 22 PPL-verified, Vision Transformers, Apple Silicon, and an audit primitive

Yesterday's blog post documented one engineering day. The work didn't stop when we hit publish; it kept landing through the evening and overnight. This is the compound — what shipped between Tuesday afternoon and Wednesday morning. Same publish-the-real-state policy as yesterday: the wins are visible right now in the public repo, the registry, and (for the audit primitive) on PyPI.

Sipsa Labs · 2026-05-28 · Engineering log (compound)

22
PPL-verified architectures (TinyLlama 1.00317× = 3rd-tightest)
0.9988
DINOv2 CLS-token cosine (first non-LLM pack live)
uc audit
customer-side audit-receipt primitive shipped

We hit 22 PPL-verified architectures, the codec generalized to Vision Transformers and audio models with no format changes, an Apple Silicon conversion path went from "no" to "0.9966 forward logit cosine," and the uc audit customer-side audit-receipt primitive shipped in the public package.

From twenty to twenty-two

The PPL-runner agent finished two canonical evaluations overnight:

The public auditor registry at github.com/sipsalabs/ultracompress/blob/main/docs/benchmarks.json now lists 22 verified records. The shipped count is also 22, so the verified set equals the shipped set for the first time since the registry started.

The codec works on Vision Transformers and audio models, too

Until tonight we sold "near-lossless 5-bit transformer compression" and meant transformer LLMs. The cross-architecture research lead spent the day running our codec against three non-LLM model families to find out where it actually generalizes. The result reframes what we sell.

DINOv2 (Vision Transformer) — near-lossless 5-bit reconstruction lands at 0.9987 CLS-token cosine against the bf16 reference. Near-lossless image-embedding quality. No format changes needed. DINOv2-Large shipped tonight at huggingface.co/SipsaLabs/dinov2-large-uc-v3-bpw5 — first public near-lossless 5-bit non-LLM pack.

Whisper-large-v3 (audio encoder-decoder, 1.54B params) — near-lossless 5-bit reconstruction lands at 0.9966 encoder cosine. Cross-attention between encoder and decoder works without special treatment. No format changes needed. Pack canonical-pending; will graduate to the public registry once we run the standard verification suite.

SDXL UNet (diffusion, 2.57B params, mix of Linear + Conv2d) — Linear layers compress identically to LLMs (0.9976 cosine). Conv2d layers needed a small format extension to reach Linear-band quality. Extension details are codec internals, available under NDA. We implemented it overnight; closing the remaining gap to 6bpw quality is on the engineering plan for next week.

The structural finding behind all three: ViT weights have 3–5× higher scalar-quantization error than LLM weights, but our pipeline handles that structural difference. The same machinery that produces 1.001–1.012× PPL ratios on transformer LLMs produces 0.997–1.000× cosine on vision and audio models. The codec is more general than we were selling.

The strategic reframe: Sipsa Labs is not a "LLM compression" company. We are a "near-lossless compression for any transformer-architecture model" company — and a 1-week format extension brings UNet-style diffusion into the same envelope. Vision, audio, multimodal, diffusion. The TAM has just expanded by an order of magnitude.

Apple Silicon: from "no" to a working converter

Yesterday's stated path was "we ship NVIDIA CUDA-first; portability is a multi-quarter project." We pulled that forward.

The hardware-portability agent built a converter that takes a UC v3 pack, reconstructs the dense bf16 Linear weight tensors locally (reconstruction details are codec internals, available under NDA or Phase 0 POC engagement), and writes a standard HuggingFace model layout that mlx-lm loads with zero custom code.

Verified on Qwen3-0.6B:

The AMD ROCm path turned out to require zero code changes — the entire UC reconstruction pipeline is torch.cuda.* against the standard PyTorch surface, and PyTorch-ROCm provides torch.cuda.* as a transparent shim over HIP. So the same converter that produces the mlx-lm-compatible safetensors also produces the ROCm-compatible safetensors. One converter, three hardware targets.

The internal converter lives in an NDA-gated tool path so the pack-internal key names aren't exposed publicly — the public documentation describes the workflow at docs/hardware/apple_silicon.md and docs/hardware/amd_rocm.md.

uc audit: the customer-side audit receipt

For regulated-AI deployers, "this model runs on the inference stack" is necessary but not sufficient. The compliance team needs an audit receipt that ties the deployed pack, by per-file SHA-256, to the exact validated artifact — so they can show the bytes serving production traffic match the bytes that were evaluated.

uc verify does the structure check and SHA-256 download integrity check — useful, but not a receipt.

uc audit is the receipt. The DevOps agent shipped it in the public package. It emits a versioned JSON receipt with: model class, bit count, per-file SHA-256 manifest + a stable pack fingerprint, the structural integrity checks, a PII-free host fingerprint (OS/arch class only — no hostname, user, MAC, or serials), schema version, and audit timestamp. It is a structural and download-integrity artifact — deliberately not a reconstruction proof, and unsigned unless you supply an Ed25519 key (the end-to-end reconstruction certificate is delivered by Sipsa Labs under engagement). Compliance teams attach it to FDA SaMD pre-submissions, SR 11-7 model risk validation packages, or DoD ATO accreditation packets as the integrity layer.

Five-case smoke test passed: zero-byte / count-mismatch / no-manifest / --stdout / determinism. Receipts produced from the same pack across runs hash to the same SHA-256.

This is the regulated-buyer credibility tile. Nobody else in the 5-bit-compression band has this primitive.

A structural finding worth surfacing: deep-layer dominance under one calibration pipeline

Yesterday's post described the MoE compression-tightness mechanism. Today's research closed two open follow-ups — one of which corrected a premise we had earlier in the day.

The deep-layer dominance finding: when we proposed redistributing correction capacity based on per-layer signals, the natural expectation was "deep layers have low variance, so they need less correction." We tested that. The result is the reverse: deep-layer reconstruction quality dominates end-to-end PPL even when deep-layer variance is low. This rules out the entire "redistribute capacity from deep to shallow" cure family under our current calibration pipeline.

The premise correction: earlier today the working assumption was that the Llama-3.1-8B 1.0125× floor was likely an architectural limit. Tonight's session invalidated that. Production 1.0125× was measured under one specific calibration pipeline. An earlier calibration approach achieved a tighter ratio on the same architecture under different assumptions. The 1.0125× floor is pipeline-specific, not architecture-specific. The 8-perturbation refutation chain we ran today is a chain against the production calibration pipeline; it does not imply Llama-3.1-8B is structurally uncompressible. The next-most-important engineering question is now what generalizes about that earlier pipeline.

Honest negative count is moving with us. The published ratio is 30:23; with the last 24 hours of additions (an exposure-bias diagnosis from our follow-up research plus two newly refuted cures) the working ratio is 33:23. We will publish the new entries once we have characterized what made the earlier pipeline different from the production one. Premature publication of "architectural limit" would have been wrong; we caught it before it landed in a buyer-facing surface.

Customer surfaces shipped overnight

The buyer funnel on sipsalabs.com is now seven tools deep:

Full funnel: discover → verify → prove → justify → compare → request → retain. Zero gaps.

What this means for whoever's reading

If you're a regulated-AI deployer evaluating compressed-model audit primitives: pip install ultracompress, run uc audit <your-pack>, attach the resulting receipt to your FDA / SR 11-7 / ATO submission. Same-day Phase 0 POC if you want the engagement to be formal — founder@sipsalabs.com, $0 / 5 business days / named case study published on close.
If you're a Mac developer wondering whether you can run a 405B-class model on an M-series Mac with near-lossless 5-bit weights: the converter exists. NDA / POC gates the codec internals; the resulting safetensors are vanilla HuggingFace format and mlx-lm loads them with zero customization.
If you're a researcher and the cross-architecture generalization interests you: the per-architecture results are reproducible. The DINOv2-Large pack ships publicly. The Whisper and SigLIP packs are queued. We expect the cross-modal-alignment-survives result to be the most contested finding; if you have a counterexample, tell us — founder@sipsalabs.com. Negative reports are useful.