What landed overnight at Sipsa Labs: 22 PPL-verified, Vision Transformers, Apple Silicon, and an audit primitive

Yesterday's blog post documented one engineering day. The work didn't stop when we hit publish; it kept landing through the evening and overnight. This is the compound — what shipped between Tuesday afternoon and Wednesday morning. Same publish-the-real-state policy as yesterday: the wins are visible right now in the public repo, the registry, and (for the audit primitive) on PyPI.

Sipsa Labs · 2026-05-28 · Engineering log (compound)

verified architectures — 22 PPL + 1 ViT cosine (TinyLlama 1.00317× = 3rd-tightest)

0.9988

DINOv2 CLS-token cosine (first non-LLM pack live)

uc audit

customer-side audit-receipt primitive shipped

We hit 22 PPL-verified architectures, the codec generalized to Vision Transformers and audio models with no format changes, an Apple Silicon conversion path went from "no" to a working converter (early internal logit-cosine result below, not yet in the public registry), and the uc audit customer-side audit-receipt primitive shipped in the public package.

From twenty to twenty-two

The PPL-runner agent finished two canonical evaluations overnight:

TinyLlama-1.1B-Chat-v1.0 → PPL ratio 1.00317× (baseline PPL 9.425031, compressed 9.454883, n=30, seq_len=1024, seed=42, FineWeb-edu held-out tail). This is the 3rd-best ratio in our entire 22-architecture registry — beaten only by Phi-3.5-MoE (1.00129×) and Phi-3-mini-4k (1.00262×). For a 1.1B-class model, this is the tightest small-architecture near-lossless 5-bit number we know about anywhere.
Llama-3.1-70B → PPL ratio 1.00898×. Verified via byte-exact safetensors conversion against existing canonical-protocol results (the HuggingFace cache for 70B was 140 GB at ~2.5 MB/s — re-download would have cost 15+ hours; instead we proved the safetensors layers match the .pt layers byte-for-byte and inherited the prior canonical eval).

The public auditor registry at github.com/sipsalabs/ultracompress/blob/main/docs/benchmarks.json now lists 23 verified records (22 PPL + 1 ViT cosine). The shipped count is also 23, so the verified set equals the shipped set for the first time since the registry started.

The codec works on Vision Transformers and audio models, too

Until tonight we sold "near-lossless 5-bit transformer compression" and meant transformer LLMs. The cross-architecture research lead spent the day running our codec against three non-LLM model families to find out where it actually generalizes. The result reframes what we sell.

DINOv2 (Vision Transformer) — near-lossless 5-bit reconstruction lands at 0.9988 CLS-token cosine against the bf16 reference. Near-lossless image-embedding quality. No format changes needed. DINOv2-Large shipped tonight at huggingface.co/SipsaLabs/dinov2-large-uc-v3-bpw5 — first public near-lossless 5-bit non-LLM pack.

Whisper-large-v3 (audio encoder-decoder, 1.54B params) — near-lossless 5-bit reconstruction lands at 0.9966 encoder cosine. Cross-attention between encoder and decoder works without special treatment. No format changes needed. Pack canonical-pending; will graduate to the public registry once we run the standard verification suite.

SDXL UNet (diffusion, 2.57B params, mix of Linear + Conv2d) — Linear layers compress identically to LLMs (0.9976 cosine). Conv2d layers needed a small format extension to reach Linear-band quality. Extension details are codec internals, available under NDA. We implemented it overnight; closing the remaining gap to 6bpw quality is on the engineering plan for next week.

The structural finding behind all three: ViT weights have 3–5× higher scalar-quantization error than LLM weights, but our pipeline handles that structural difference. The same machinery that produces 1.001–1.012× PPL ratios on transformer LLMs produces 0.997–1.000× cosine on vision and audio models. The codec is more general than we were selling.

The strategic reframe: Sipsa Labs is not a "LLM compression" company. We are a "near-lossless compression for any transformer-architecture model" company — and a 1-week format extension brings UNet-style diffusion into the same envelope. This extends our addressable model families to vision, audio, multimodal, and diffusion.

Apple Silicon: from "no" to a working converter

Yesterday's stated path was "we ship NVIDIA CUDA-first; portability is a multi-quarter project." We pulled that forward.

The hardware-portability agent built a converter that takes a UC v3 pack, reconstructs the dense bf16 Linear weight tensors locally (reconstruction details are codec internals, available under NDA or Phase 0 POC engagement), and writes a standard HuggingFace model layout that mlx-lm loads with zero custom code.

Converter-correctness measured on Qwen3-0.6B (see the honesty note below for where this ran):

Scaffold tensors (embeddings, model norm, lm_head): bit-exact match against the bf16 source
Layer weights: cosine similarity 0.9987 – 1.0000 vs bf16 original
Forward logit cosine: 0.9966 across 30 prompts (min 0.9898)
Greedy token match: 90.8% vs bf16 reference (the expected compression gap)
Conversion time: 31.4 seconds end-to-end
Output size: 1.50 GB (single safetensors file)

Honesty note on these numbers. The 0.9966 forward logit cosine was measured on our CUDA dev machine (Windows / RTX 5090), not on an Apple Silicon Mac — it is a converter-correctness check confirming the output is standard, loadable HuggingFace safetensors, not an on-device measurement. It is an internal-validation result, not yet in the public registry. We also no longer hold the exact original pack used for that run, so the precise number is not currently reproducible (the methodology is sound and the figure is consistent with Qwen3-0.6B's published 1.0069× PPL ratio). The on-device mlx-lm Apple Silicon test — actually loading and running the converted model on an M-series chip — is still pending. "mlx-lm loads it" is an engineering inference from the standard safetensors format, not yet a verified on-device run. We'll add it to the registry once that test lands.

The AMD ROCm path turned out to require zero code changes — the entire UC reconstruction pipeline is torch.cuda.* against the standard PyTorch surface, and PyTorch-ROCm provides torch.cuda.* as a transparent shim over HIP. So the same converter that produces the mlx-lm-compatible safetensors also produces the ROCm-compatible safetensors. One converter, three hardware targets.

The internal converter lives in an NDA-gated tool path so the pack-internal key names aren't exposed publicly. The workflow — a UC v3 pack reconstructed into a standard HuggingFace safetensors layout — is walked through under NDA / Phase 0 engagement.

uc audit: the customer-side audit receipt

For regulated-AI deployers, "this model runs on the inference stack" is necessary but not sufficient. The compliance team needs an audit receipt that ties the deployed pack, by per-file SHA-256, to the exact validated artifact — so they can show the bytes serving production traffic match the bytes that were evaluated.

uc verify does the structure check and SHA-256 download integrity check — useful, but not a receipt.

uc audit is the receipt. The DevOps agent shipped it in the public package. It emits a versioned JSON receipt with: model class, bit count, per-file SHA-256 manifest + a stable pack fingerprint, the structural integrity checks, a PII-free host fingerprint (OS/arch class only — no hostname, user, MAC, or serials), schema version, and audit timestamp. It is a structural and download-integrity artifact — deliberately not a reconstruction proof, and unsigned unless you supply an Ed25519 key (the end-to-end reconstruction certificate is delivered by Sipsa Labs under engagement). Compliance teams attach it to FDA SaMD pre-submissions, SR 11-7 model risk validation packages, or DoD ATO accreditation packets as the integrity layer.

Five-case smoke test passed: zero-byte / count-mismatch / no-manifest / --stdout / determinism. Receipts produced from the same pack across runs hash to the same SHA-256.

This is the regulated-buyer credibility tile. Nobody else in the 5-bit-compression band has this primitive.

A structural finding worth surfacing: deep-layer dominance under one calibration pipeline

Yesterday's post described the MoE compression-tightness mechanism. Today's research closed two open follow-ups — one of which corrected a premise we had earlier in the day.

The deep-layer dominance finding: when we proposed redistributing correction capacity based on per-layer signals, the natural expectation was "deep layers have low variance, so they need less correction." We tested that. The result is the reverse: deep-layer reconstruction quality dominates end-to-end PPL even when deep-layer variance is low. This rules out the entire "redistribute capacity from deep to shallow" cure family under our current calibration pipeline.

The premise correction: earlier today the working assumption was that the Llama-3.1-8B 1.0125× floor was likely an architectural limit. Tonight's session invalidated that. Production 1.0125× was measured under one specific calibration pipeline. An earlier calibration approach achieved a tighter ratio on the same architecture under different assumptions. The 1.0125× floor is pipeline-specific, not architecture-specific. The 8-perturbation refutation chain we ran today is a chain against the production calibration pipeline; it does not imply Llama-3.1-8B is structurally uncompressible. The next-most-important engineering question is now what generalizes about that earlier pipeline.

Honest negative count is moving with us. The published ratio is 30:23; with the last 24 hours of additions (an exposure-bias diagnosis from our follow-up research plus two newly refuted cures) the working ratio is 33:23. We will publish the new entries once we have characterized what made the earlier pipeline different from the production one. Premature publication of "architectural limit" would have been wrong; we caught it before it landed in a buyer-facing surface.

Customer surfaces shipped overnight

The buyer funnel on sipsalabs.com is now seven tools deep:

/leaderboard — UltraCompress vs AWQ vs GPTQ vs EXL3 side-by-side, UC sweeps the top 23 sorted by PPL ratio
/tools/verify — paste any HuggingFace pack URL, get back the verification result with full audit log
/badge — one-line embed for customer model cards, four style variants
/calculator — cost-savings input form, drives Phase 0 POC inquiries
/compare — your AWQ or GPTQ pack vs the equivalent UC pack with a downloadable comparison card
/request — request-a-pack form, drives roadmap signal
/feed — RSS + Atom + JSON pack-drop feed, retention surface

Full funnel: discover → verify → prove → justify → compare → request → retain. Zero gaps.

What this means for whoever's reading

If you're a regulated-AI deployer evaluating compressed-model audit primitives: pip install ultracompress, run uc audit <your-pack>, attach the resulting receipt to your FDA / SR 11-7 / ATO submission. Same-day Phase 0 POC if you want the engagement to be formal — founder@sipsalabs.com, $0 / 5 business days / named case study published on close.

If you're a Mac developer wondering whether you can run a 405B-class model on an M-series Mac with near-lossless 5-bit weights: the converter exists. NDA / POC gates the codec internals; the resulting safetensors are vanilla HuggingFace format and mlx-lm loads them with zero customization.

If you're a researcher and the cross-architecture generalization interests you: the per-architecture results are reproducible. The DINOv2-Large pack ships publicly. The Whisper and SigLIP packs are queued. We expect the cross-modal-alignment-survives result to be the most contested finding; if you have a counterexample, tell us — founder@sipsalabs.com. Negative reports are useful.