"100T on one GPU" — what's actually loaded where, honestly

When we say a 405B-parameter model runs on a single 32 GB consumer GPU, we mean streamed from durable storage during reconstruction, not resident in VRAM. Full VRAM residency is information-theoretically impossible at that scale. The engineering claim is still real — but the honest picture is more interesting than the marketing one, and we'd rather be precise about it than get caught.

Sipsa Labs · 2026-05-29 · Engineering note

The arithmetic anyone can do

Hermes-3-Llama-3.1-405B has 405 billion parameters. At bf16 (2 bytes per parameter) that is 810 GB. A single RTX 5090 has 32 GB of VRAM. So no — not literally — the bf16 reference weights do not fit in VRAM. They fit on disk, where 810 GB is roughly a third of a 2.5 TB consumer SSD.

Compressing the bf16 weights to 5 bits per parameter reduces the on-disk footprint to about 250 GB. Still doesn't fit in 32 GB of VRAM. So either the claim is wrong, or the claim means something more specific than "the whole model is loaded in GPU memory." The claim means the second thing.

What "runs on a single 32 GB GPU" actually means

Two distinct things happen during inference on a UltraCompress pack:

1. Reconstruction. A compressed .uc pack lives on durable storage. To run inference, the pack is reconstructed layer by layer into bf16 weights. Each layer is reconstructed independently and deterministically — the output is byte-for-byte identical to the dequantized weights recorded in the pack at compress time (the validated artifact, not the original full-precision model — 5-bit compression is lossy against the original), every load, on any hardware (that is the SHA-256 contract the rest of the audit story depends on).

2. Streaming forward pass. Once reconstructed, the bf16 weights for the current layer are loaded into VRAM, the activations flow through, the next layer's reconstructed weights are loaded into VRAM, and so on. At no point does the entire 405B-parameter weight tensor sit in VRAM at once. The peak in-VRAM footprint at any given time is bounded by the activations for the current sequence plus the weights for the current layer, which on a 32 GB consumer GPU is achievable for 405B-class models.

The engineering claim is real. The model in production produces real outputs from real prompts on the kind of GPU you can buy in a retail electronics shop. What is not real is the literal reading "all 405 billion bf16 parameters are sitting in VRAM at the same time." That is information-theoretically impossible at 32 GB.

Why this distinction matters

It matters for three audiences, in different ways.

Researchers and reviewers. Anyone with a calculator can check the arithmetic and notice the gap. If the marketing prose says "405B in VRAM," that reader stops reading at that sentence and never gets to the actual engineering claim. The honest framing — "405B streamed from disk through a 32 GB VRAM window" — is harder to misread and stays defensible under technical scrutiny.

Production engineers evaluating deployment. The streaming-from-disk path has a different latency profile than full-residency. Cold start is gated by reconstruction time (about three minutes on CPU for 8B-class models, longer at 405B scale). Decode throughput, once warm, is gated by reconstruction-to-VRAM bandwidth, not weight-tensor-in-VRAM bandwidth as in the dense-bf16-resident case. Anyone planning capacity needs the actual model, not the marketing summary of it.

Regulated buyers. The audit story rests on the SHA-256 reconstruction contract. That contract is between the pack on disk and the bf16 weights that go to VRAM, not between any in-memory representation and any other in-memory representation. The honest framing is what makes the audit primitive cleanly applicable: "the artifact we audited is the one on your storage layer, and reconstruction is deterministic from there into the GPU."

What is in VRAM at any given moment

For a typical 405B-class forward pass on a 32 GB GPU at bf16, peak VRAM usage looks roughly like:

Component	Approximate VRAM
One transformer layer's bf16 weights (during compute)	~1.5 - 2 GB
Activations for current sequence (bf16, seq_len up to 2048)	1-4 GB
KV cache for attention	1-8 GB (depends on context length and batch)
CUDA framework overhead + buffers	2-3 GB
Reconstruction working set (next layer being prepared)	~1.5 - 2 GB
Total peak	~7 - 19 GB

The numbers above are illustrative; exact figures depend on batch size, sequence length, and the specific layer being processed. The point is the peak is small relative to the total bf16 weight tensor — about an order of magnitude smaller, which is what makes the streaming-from-disk strategy work on consumer hardware.

What this is not

This is not the same as cooperative-GPU offload like Hugging Face accelerate or DeepSpeed inference. Those tools shard pre-quantized weights across CPU + GPU + disk and incur predictable PCIe bandwidth penalties on every forward pass. The UltraCompress streaming-from-disk path is reconstruction-of-compressed-weights-to-bf16, which happens once per layer per generation step and is gated by the much faster NVMe path rather than the PCIe DMA path for pre-quantized weight transfer.

It is also not "the model is virtually in memory" the way memory-mapped files are virtually in memory but get faulted in on access. The reconstruction is an explicit decoding step, not a lazy page fault. The pack contains compressed weights; the bf16 weights are produced from them deterministically; the produced weights go to VRAM for the duration of one layer's compute and are then released.

And it is emphatically not "the whole model lives in VRAM." We have said this a few times because if any one sentence from this post propagates, that is the one we want propagating.

The engineering claim, stated cleanly

An UltraCompress pack of Hermes-3-Llama-3.1-405B, on a single RTX 5090 with 32 GB VRAM, produces real chat-completions output. The wall-clock latency profile is dominated by disk-to-VRAM bandwidth at warm-cache state and by reconstruction at cold start. The output tokens per second are within the same order of magnitude as a dense-bf16-resident deployment on a larger GPU; we publish the actual measured numbers per model in our verified benchmarks.

The SHA-256 reconstruction contract holds: the bf16 weights that go to VRAM are byte-for-byte identical, every reconstruction, to the dequantized weights recorded in the pack at compress time — the validated artifact we evaluated to produce the published PPL ratio (5-bit compression is lossy against the original reference checkpoint; what the contract guarantees is exact reproduction of that validated artifact). That contract is what regulated buyers actually need from a compression layer, and it is preserved exactly in the streaming-from-disk architecture.

That is the engineering claim. We would rather you read it and decide on your own whether the architecture fits your use case, than meet you halfway through a forum thread arguing about whether "100T on one GPU" was overclaim. It was a shorthand. The arithmetic is above. Now you have the long form.

If a specific claim in this post contradicts a future Sipsa Labs marketing surface, this post wins. The arithmetic above is what the engineering actually does.