Hermes-3-405B compressed lossless at 5 bits.

Largest dense transformer artifact published to HuggingFace at 5-bit lossless. Compressed end-to-end on dual RTX 5090s in 13 hours of wall clock. 251 GB pack on disk. Perplexity ratio 1.0066x verified against the bf16 baseline. SHA-256 bit-identical customer-side reconstruction.

UltraCompress · 2026-05-09 · Posted by the Sipsa Labs team

405B
Parameters
1.0066x
PPL ratio (verified)
251 GB
Pack on disk
13 hr
Wall clock, dual 5090

Hermes-3-Llama-3.1-405B is NousResearch’s open-weights fine-tune of Meta’s Llama-3.1-405B. Half a trillion parameters minus a hundred billion. The bf16 weights ship as 810 GB across 191 safetensors shards. It is the largest single dense transformer that the open-source community has reasonable access to at all, and one of the smallest meaningful proxies for what frontier-class compression has to handle when DeepSeek-V3 685B and the trillion-class wave hit the public Hub.

We compressed it. End-to-end. On a single workstation with two RTX 5090 GPUs, sequentially streaming the model layer-by-layer rather than holding it all resident. Thirteen hours wall clock. The output: a 251 GB UltraCompress v3 pack at 5 bits per weight. Verified perplexity ratio against the bf16 baseline of 1.0066x.

Why 405B matters

The compression argument has always been clean at the small end. Squeezing a 1.7B-parameter base model into 1.1 GB with sub-1% perplexity drift makes for a nice demo. It does not move the production-deployment needle. The numbers that move the needle are the ones at the top of the parameter-count distribution — the models so large that running them at full precision is its own infrastructure problem.

405B is where the math starts to get interesting. The bf16 footprint of Hermes-3-405B is 810 GB. That is ten H100s of weights alone before you have rented a single byte for activations or KV cache. The 5-bit compressed pack is 251 GB. That is three H100s of weights, or fits on two H200s with room. The compression ratio is 3.23x, and the perplexity drift is 0.66%. Sub-percent quality drift at the largest publicly-compressible scale is, as far as we have been able to verify across the literature and the public Hub, a result that did not exist before this run.

What “1.0066x verified” means

The number is an end-to-end perplexity ratio measured on a held-out tail of FineWeb-edu, computed against the bf16 baseline measured on the same eval split with the same eval harness. The baseline measurement and the compressed measurement both come from the trainer side, but the compressed measurement was re-checked at customer side after the pack was written to disk and the bytes round-tripped through the public HuggingFace artifact.

We do not invent a precision floor by measuring on a different eval set than the baseline. We do not pre-compute the reference perplexity once and re-use it. The same harness produced 6.851 baseline perplexity and 6.896 compressed perplexity on this eval. Ratio: 1.00657, rounded to 1.0066x in the public table.

The reproducer command is the same three commands every other artifact in our matrix uses:

pip install ultracompress
hf download SipsaLabs/hermes-3-llama-3.1-405b-uc-v3-bpw5 --local-dir ./hermes-3-405b
uc bench ./hermes-3-405b

The artifact is gated — access requires a one-click request on HuggingFace. We approve research and sub-$1M-ARR commercial use grants under BUSL-1.1. Production-tier deployment is a paid commercial conversation; write to founder@sipsalabs.com.

How dual 5090 + streaming made this fit

Two RTX 5090s is 64 GB of total VRAM. Hermes-3-405B at bf16 is 810 GB of weights. The naive arithmetic says compression is a hundred-and-ten-H100 problem at this scale; the per-layer streaming design says it is two consumer GPUs and time.

Peak GPU memory during the compress step is bounded by approximately one transformer block resident plus calibration activations, regardless of total model size. For 405B with 126 layers, each block is roughly 6.4 GB at bf16. Add a calibration batch of activations on the second GPU and we are operating well inside the dual-32GB envelope, with the rest of the model living on disk and being mmapped as the streaming scheduler asks for it.

The result of that design choice: the cost of this 405B compression run was the wall-clock electricity to keep the workstation running for thirteen hours. Not a cloud bill. Not a sign-up. The production-deployment economics of compressed 405B at sub-1% drift compound from there.

What is next

DeepSeek-V3 685B is the closest open trillion-class candidate. We are queueing the download as a follow-on. The same streaming design that handled 405B should handle 685B at roughly 18-20 hours wall clock; the bottleneck is bandwidth from the source bucket, not compute.

The artifact-level next direction: published verified perplexity ratios for the multi-shard MoE results in the matrix (Mixtral-8x22B, Phi-3.5-MoE, Qwen3-235B-A22B) at the same level of harness rigor we just applied to Hermes. Some of those compressed runs already exist on disk and need bf16-baseline measurements that we have been deferring because the bf16 baselines themselves are 200+ GB problems.