Compressed 8B models load 3.2× faster into vLLM — decode throughput at parity

Cold starts get expensive at scale. We measured what an UltraCompress pack does to vLLM load time on a single RTX 5090. Two 8B-class models tested, both load in about 45 seconds where bf16 took about 145. Decode tokens-per-second across batch sizes is parity within measurement noise. The hot-path performance does not move; the cold-path cost drops by roughly two thirds.

Sipsa Inference · 2026-05-28 · Posted by the Sipsa Labs team

3.2×
Faster vLLM load, Qwen3-8B
+0.4%
Mean decode tok/s delta (noise)
3.1×
Smaller on disk
RTX 5090
Single 32 GB consumer GPU

Why cold start matters

If you run inference for anyone other than yourself, vLLM cold start is a real cost. Every autoscale-up event, every node failover, every CI deploy, every rolling restart spends model-load time before the worker can accept traffic. A 100-second improvement on an 8B model is a 100-second reduction in customer-visible warmup latency, multiplied by every restart event in your fleet.

Decode tokens-per-second is the metric that gets all the attention because it is the hot-path number. But cold-path latency shows up in your SLO data as tail-of-tail outliers — the 99th-percentile request that landed on a worker mid-warmup. Cutting that latency tightens the SLO without touching the hot path.

What we measured

vLLM 0.20.0, bf16, max_model_len=2048, single RTX 5090 (32 GB), WSL2 Ubuntu host. Each model run end-to-end as both an UltraCompress .uc pack (reconstructed once to bf16 safetensors, then loaded by vLLM) and as the upstream bf16 baseline loaded from the HuggingFace cache. Decode throughput measured with 256-token responses across batch sizes 1, 4, and 8.

ModelUC loadbf16 loadDeltaB8 UC tok/sB8 bf16 tok/sDelta
Qwen3-1.7B30.6 s39.0 s-22%2056.82045.6+0.6%
Qwen3-8B44.6 s144.8 s-69%726.1723.2+0.4%
Llama-3.1-8B44.8 s140.6 s-68%724.1720.0+0.6%

Pack sizes: Qwen3-1.7B is 1.11 GB versus 3.4 GB bf16. Qwen3-8B is 5.13 GB versus 16.0 GB bf16. Llama-3.1-8B is 5.13 GB versus 16.0 GB bf16. Pack-format compression ratio is 3.1× on disk after the layer-level reconstruction metadata the pack ships.

Where the load-time win comes from

Two effects compose, and the second dominates at the 8B scale.

Disk read time. The pack is roughly one third the size on disk, so reading it from durable storage takes proportionally less time. That alone is worth roughly 10-15 seconds at 8B sizes on this hardware.

Mounted-filesystem cost. On the WSL2 host we tested, the bf16 baseline lives under the HuggingFace cache mounted through the Windows 9P bridge. The reconstructed pack writes a fresh safetensors file into the EXT4 path on Linux's native filesystem. vLLM's mmap-and-init path is much faster against native EXT4 than against the bridge. Customers on bare-metal Linux will see less of this term; customers on Docker, containerd, or other layered-filesystem setups will see more.

The 1.7B model shows only a 22% load-time win because at that scale the absolute load time is small (30-40 seconds) and per-pack overheads dominate. The 8B class is where the win shows up cleanly, because the bf16 baseline takes long enough for the filesystem bridge to compound.

Throughput parity, by design

The decode-side numbers are the same on both paths. Phase 2 reconstructs the pack once into a standard bf16 safetensors file at load time, so once vLLM is serving traffic it is serving from bf16 weights identical to what it would have served from the original checkpoint. The kernels are the same. The KV cache is the same. The CUDA graphs are the same. The benchmark confirms that empirically: every batch-1, batch-4, and batch-8 delta sits inside run-to-run measurement noise, with a mean delta of +0.4% across the three models.

That is the contract: compression buys you smaller artifacts and faster cold start; it does not cost you serving throughput.

What this is not

Two things this measurement does not claim, because we publish negative results alongside the wins.

Reconstruction overhead is real. Phase 2 reconstruction runs once per pack on CPU and takes 175-190 seconds at 8B sizes. After that, the safetensors are cached and reloaded directly. If you load the same model many times (production deployment, autoscale group), reconstruction amortizes immediately. If you reconstruct once and never load again (one-shot evaluation), the math goes the other way; the bf16 baseline is faster end-to-end.

VRAM is not smaller in Phase 2. Reconstruction produces bf16 weights, so the in-VRAM footprint is identical to the bf16 baseline. An optimized inference kernel that keeps weights compressed in VRAM is on the roadmap and is not yet shipped. The VRAM win is open work.

How to try it

If you want to reproduce the measurement on your own hardware, the public catalog covers the small architectures end-to-end:

pip install ultracompress
uc try sipsa-qwen3-0.6b      # 30-second demo via the free inference API
uc catalog                    # browse the available models

For the larger gated packs and a SHA-256 reconstruction audit, contact us — see the CTA below.


All benchmark numbers measured on Sipsa Labs hardware (RTX 5090, vLLM 0.20.0, WSL2 Ubuntu) on 2026-05-23. Full Phase 2 report is internal; reach out if you want a deep dive. Near-lossless 5-bit packs are released as BUSL-1.1 + Additional Use Grant, free for individuals, research, and commercial use under $1M ARR.