Model-risk guidance and the deployment-validation gap quantization opens

The revised interagency model-risk guidance (OCC Bulletin 2026-13, which superseded SR 11-7 on April 17, 2026) expects the model in production to be the model that was validated. Standard 4-bit quantization breaks that link silently. Cryptographic reconstruction makes the deployed-equals-validated check a one-line audit primitive.

Sipsa Inference · 2026-05-25 · Posted by the Sipsa Labs team

OCC 2026-13

Model-risk guidance (superseded SR 11-7)

SHA-256

Deployed = validated, one check

4-bit

Where the link breaks

22 + 1

PPL-verified + 1 ViT cosine-verified

A note on which rule governs: the long-standing model-risk framework banks know as SR 11-7 was rescinded and replaced by the revised interagency model-risk guidance (OCC Bulletin 2026-13, which superseded SR 11-7 on April 17, 2026). The deployed-equals-validated expectation discussed below carried forward unchanged; throughout this piece “the guidance” refers to OCC Bulletin 2026-13, and SR 11-7 is named where the historical lineage matters.

Most discussions of this guidance focus on the validation side of the framework — the rigor of pre-deployment testing, the independence of the validation function, the documentation of model assumptions. That's where banks spend their compliance budget. But there's a quieter requirement the framework imposes that traditional model risk teams handle through process attestation rather than technical proof: the model in production must be the model that was validated.

For decades that requirement was operationally trivial because the validated model was a set of regression coefficients in a file, and the deployed model was that same file copied to a production server. The chain of custody was the file system + change control + a sign-off. Auditable, defensible, simple.

Then came large neural-net models in production. And then came compression — because a $300K bf16 deployment-grade GPU per branch wasn't going to happen, and quantization opened the cost-per-inference math that made the deployment viable in the first place.

What nobody quite reckoned with: standard quantization breaks the deployed-equals-validated link silently.

The mechanism — why standard quantization breaks the guidance's quietest assumption

Consider a bank governed by the interagency model-risk guidance deploying a large language model for, say, complaint triage or first-pass anti-money-laundering review. The model risk function validates the model at full precision (bf16 or fp16) on a held-out historical dataset. The validation report is signed off. The model gets pushed to production.

In production, the model runs at 4-bit precision. AWQ, GPTQ, EXL3, or bnb-nf4 — pick your kernel. The reasoning is rational: 4-bit serves 4x more requests per dollar than bf16 on the same hardware.

But the 4-bit weights in production are not the bf16 weights that were validated. They are a numerical approximation of those weights. They differ by an amount that depends on:

Which quantization kernel was used
Which CUDA version compiled it
Which GPU SM generation runs it
Which dropout / cache state was active during the calibration pass

Across two different production servers running the "same" 4-bit pack, you can observe — and we have observed, and this is published in Chen et al., arXiv 2408.05148 — slightly different reconstructed weights. Outputs that match in 999 cases out of 1000. The 1000th case is where the audit narrative breaks.

The guidance doesn't require that this 1000th case be detected. It requires that the bank can answer affirmatively when an examiner asks: "is the deployed model the validated model?"

For full-precision deployments, the answer was "yes, here's the file hash." For 4-bit quantized deployments, the honest answer is: "the deployed model is a numerically-approximate version of the validated model, with a drift that depends on the runtime stack." That answer doesn't survive a model risk audit even though the bank's actual compliance posture is reasonable.

The gap is real and it's growing as more banks compress more models for production.

What a model-risk audit actually wants to see

The framework's text is precise. The revised interagency model-risk guidance (OCC Bulletin 2026-13, which on April 17, 2026 superseded SR 11-7 and its OCC companion, OCC 2011-12) requires that:

Model validation occurs prior to use
Validation documentation describes the specific model artifact that was validated
The model in use in production is consistent with the validated model
Changes to the model trigger re-validation

The third bullet is what compression breaks. The model in use is consistent with the validated model in spirit — same architecture, same training, same intended behavior — but not identical to the validated model in numerical fact. Examiners who care about defensible posture want a positive answer to "identical, byte for byte." Many production deployments today cannot provide one.

The conventional response is process: change-management logs, runtime configuration attestations, signed-off quantization-pipeline documentation, statistical drift monitoring. These work as compensating controls but they don't answer the underlying question; they substitute attestation for proof.

A cryptographic answer to a cryptographic question

The question "is the deployed model the validated model" is at its heart a cryptographic question. Two artifacts are either identical or they are not. The bytes match or they don't.

Near-lossless 5-bit compression with SHA-256 reproducible reconstruction reframes this. The compressed pack ships with a per-tensor SHA-256 manifest. When you reconstruct the weights at deployment time, every tensor's SHA-256 matches the manifest entry. The manifest is the cryptographic fingerprint of the validated model. The reconstruction is the cryptographic fingerprint of the deployed model. The compliance check is:

uc verify <deployed_pack>

If it returns the same fingerprint as the validation report's manifest entry, the deployed model is byte-for-byte the validated model. Not approximately. Not statistically. Identically. The same property the model risk team had for a regression coefficient file in 2005 is restored for a 70B-parameter LLM in 2026.

One detail of honesty matters here: the 5-bit quantization step itself is lossy. There's a perplexity ratio (we publish 1.0125x for Llama-3.1-8B, 1.0066x for Hermes-3-405B, and so on) that represents the small distance between the bf16 reference and the 5-bit reconstruction. The bank validates the 5-bit reconstruction directly — that's the deployable artifact — and from that point on the SHA-256 manifest is the bridge to production. You qualify the exact artifact you deploy; you don't ask the validator to qualify a different artifact than what serves traffic.

What this changes operationally for the model risk function

Three concrete shifts:

1. The deployment audit becomes a control, not an attestation. Instead of "the deployed model is consistent with the validated model per process records," the bank can run uc verify against the deployed artifact and produce a cryptographic match against the validation manifest. That moves the compliance posture from "trust + paper trail" to "trust + paper trail + cryptographic check on demand."

2. The model inventory becomes a registry of digests. The guidance's expectation of a comprehensive model inventory typically lives in a SharePoint table or a GRC tool. Adding a SHA-256 manifest column for every compressed model — and a check that the production-running model matches that manifest — converts the inventory from a documentation artifact into a verification harness.

3. The change-management gate becomes binary. A re-quantization, re-fine-tune, or kernel-update changes the SHA-256 fingerprint. The model risk team's gate "did the model change in a way that requires re-validation?" reduces to "did the deployed manifest entry change since last validation?" — a one-line check that runs in CI.

None of this is novel cryptographically. It's been the standard pattern in code signing, software supply chain (SLSA), TLS certificate transparency, and container image registries for years. The novel piece is applying the pattern to AI model deployment in a way that survives compression — which until reproducible 5-bit reconstruction wasn't possible.

Where the bank's procurement function asks the question

The honest reading of where this matters: most banks are not going to be the first adopters of cryptographic model provenance. The compliance posture today is "process attestation" and examiners are accepting it. The change happens when one examiner asks for a cryptographic check and the answer is no — at which point every model risk team in the country is reading the same memo about updating their compliance protocols.

The institutions that should be thinking about this now are:

The ones running large language models in regulated decision-support workflows (fraud, AML, complaint resolution, loan adjudication assistance)
The ones whose audit committee asks pointed questions about AI risk
The ones who anticipate examiner scrutiny in the next 12-24 months
The ones for whom "trust + paper trail" is a risk profile they want to upgrade

The audit primitive your model risk function needs already exists. The question is when your institution adopts it.

Sipsa Labs builds near-lossless 5-bit transformer compression with SHA-256 reproducible reconstruction. Patents pending. Engineering due-diligence under NDA: founder@sipsalabs.com. Live inference API at api.sipsalabs.com.