How FDA SaMD reviews handle quantized AI models — and the audit gap nobody talks about
FDA's Predetermined Change Control Plan framework assumes the deployed model is the validated model. Standard quantization — AWQ, GPTQ, EXL3, bnb-nf4 — makes that assumption false by construction, and model-signing proves only the file at rest, not that the deployed weights still match what was validated. This is the audit gap, and what a regulated AI team can do about it today.
If you are running a 510(k)- or De Novo-cleared AI in production, you have already learned that FDA cares about two things you may not have cared about in your last ML role: what the deployed model is, and what it is allowed to become without a new submission. The first is provenance. The second is a Predetermined Change Control Plan. Quantization sits squarely in between them, and the current state of the practice does not close the gap.
The setup: PCCPs and what they pin to
FDA's December 2024 final guidance — Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions — formalized the mechanism for shipping iterative changes to a cleared AI without filing a new 510(k) for every update. Three things are load-bearing in that document. A PCCP must specify (i) the planned modifications, (ii) the Modification Protocol — the methodology to develop, validate, and implement them, and (iii) an impact assessment. Authorized modifications must not alter the device's intended use, and in most cases not the indications for use either.
The December 2024 final builds on the October 2023 Predetermined Change Control Plans for Machine Learning-Enabled Medical Devices: Guiding Principles (FDA / Health Canada / MHRA), the June 2024 Transparency for Machine Learning-Enabled Medical Devices: Guiding Principles (same three regulators), and the Good Machine Learning Practice principles before them. In January 2025 FDA released a draft Total Product Lifecycle (TPLC) Management of AI/ML SaMD guidance that explicitly stitches PCCPs into a lifecycle framework — pre-market, post-market, and the seam between them. As of writing there is no further public final or draft in 2026 that supersedes the December 2024 PCCP final on this specific question; if that changes, the structural problem below does not.
The structural problem is this. A PCCP pins your future to a specific validated model. The Modification Protocol describes how it can change. Anything outside the protocol is, regulatorily, a different device. Which means the moment you ship, what runs in production has to be the artifact you described to the agency — within the tolerance the agency accepted. Not "the same model, retrained." Not "the same model, quantized for inference." The artifact.
The problem: standard quantization is lossy by contract
Every real-world inference team quantizes. You cannot serve a 70B-parameter model in production at FP16 economics. So engineers reach for AWQ, GPTQ, EXL3, or bnb-nf4. These are good methods. They are also, by design, lossy.
That is not a slur, it is a contract. AWQ and GPTQ minimize calibration-set perplexity error subject to a low-bit weight format. EXL3 (the EXL2/EXL3 family) uses learned codebooks and search-based quantization. bitsandbytes nf4 maps full-precision weights onto a small fixed grid. In all four cases, the deployed weight matrix is numerically different from the weight matrix you validated. That difference is small in expectation, but it is not zero, and the error distribution is not the same across:
- different calibration data,
- different quantization runs at the same data (search-based and clustering-based methods have stochastic components),
- different CUDA / kernel / GPU generations downstream.
The third point matters more than people admit. Floating-point arithmetic is not associative, GPU reductions reorder partial sums in ways that depend on grid size and kernel version, and atomic accumulators are nondeterministic at the bit level. Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications (Chen et al., arXiv 2408.05148, 2024) walks through this in detail for both training and inference; the punchline is that bitwise-identical outputs across hardware and library versions are not the default, they are an engineering achievement that costs throughput.
Two operational consequences for a regulated AI team.
First, the deployed model is not the validated model. It is a numerically nearby approximation produced by a quantization step that happened after validation. The accuracy gap is small if you measured it. It is unknown if you did not.
Second, you cannot tell a model fault from a quantization artifact in post-market monitoring. When a clinical-decision-support output drifts, your incident response has to consider both "the model is wrong" and "the kernel built the model differently this week." That is a strictly worse position than a regulator should accept and than your MRM peers will eventually flag.
The audit gap: there is no standard primitive for "deployed = validated"
Now the part nobody talks about. Model-signing standards do exist — the OpenSSF Model Signing (OMS) spec and Sigstore prove the integrity and provenance of the model file at rest. What they do not prove is that the deployed, post-conversion weights the GPU actually executes still correspond to the artifact FDA reviewed. Standard quantization rebuilds that state at load time, so signing the submitted file leaves the deployed-equals-validated question open.
The closest analogues do not solve this:
- A SHA-256 of the safetensors file you submitted. Useful — and we recommend it below — but standard quantization toolchains rebuild quantized state at load time (kernel-specific packing, dequantization-on-the-fly, mixed-precision execution paths), so the bytes you hashed are not the bytes the GPU executes. The hash proves the file is identical. It does not prove the weights the kernel uses are identical across hardware.
- Reproducible builds / SLSA provenance. Tells you the binary came from the source you trust. Says nothing about whether the numerics on this GPU equal the numerics on that GPU.
- Container image digests. Same problem one layer up.
- Differential testing on a held-out validation set. Statistical, not cryptographic. Closes some of the gap. Does not close all of it, and does not survive an FDA inspector asking "show me the proof, not the approximation."
The honest summary: most cleared AI products today rely on process attestation that the deployment pipeline did not corrupt the model. They cannot give you a one-line cryptographic check that the bytes the GPU just used to score a patient are the bytes the validation report measured. Your MRM team's "model provenance" complaint is this complaint.
What near-lossless 5-bit reconstruction changes
A reconstruction contract changes the shape of the problem. The Sipsa Labs codec — one technical option among several worth comparing — compresses a model once into a .uc pack that reconstructs reproducibly on every load and on any hardware. The measured max-absolute-difference across reconstructions is 0.00e+00 in 32-bit float. Not "small," zero. Because reconstruction is exact, the post-reconstruction weights have a stable SHA-256, and that hash means something: it is the digest of the exact tensor the kernel will execute against.
We do not disclose codec internals here, by deliberate choice. The differentiator is not the how, it is the contract: artifact in, bit-identical artifact out, with a hash that survives reload and hardware change. Other approaches are working on similar contracts — full lossless coding of FP weights, fixed-point representations with published reconstruction proofs, or sealed inference enclaves that attest to the loaded weights. A regulated team should evaluate any of them on the same axis: does the deployment artifact equal the validated artifact, and can you prove it in one cryptographic check?
The honesty constraint here is the same one we publish for every Sipsa pack. Near-lossless 5-bit reconstruction is byte-for-byte identical against the quantized tensor; the quantized tensor itself has a measured, small perplexity ratio to the original full-precision model — for instance 1.0066× on a 405B Hermes-3 pack. You qualify the artifact you will deploy, including its measured deviation from FP16, and from that point forward every deployment is provably that artifact. That is a different claim than "lossless against full precision," and you should not let any vendor — including us — blur it.
What a regulated AI team should do today
You do not need to wait for a vendor decision to close most of this gap. The discipline that gets you most of the way there is mostly procedural, and FDA reviewers respond well to it because it maps cleanly onto the December 2024 PCCP guidance.
- Treat the deployed artifact as the regulated artifact, not the source weights. Whatever model the kernel actually loads — quantized, packed, sealed — is what your PCCP is pinned to. Validate that artifact end-to-end. Do not validate FP16 and ship INT4 and assume the gap is small.
- Maintain a SHA-256 manifest of the validated model weights at the byte representation the kernel consumes. Not the source safetensors file alone — the reconstructed, post-load tensors, named and hashed individually. This is your provenance ground truth. Store the manifest in your QMS document control alongside the validation report.
- Require every production deployment to match a registered manifest. Recompute hashes on load. A mismatch is a deployment halt, not a warning. This is a control your auditor can actually verify; "we trust the CI pipeline" is not.
- Prefer compression schemes that publish a reconstruction contract. If your quantizer's output is "approximately the same weights with high probability," you are paying for that uncertainty downstream — in monitoring ambiguity, in incident-response time, and in the PCCP impact-assessment surface you have to defend. If your quantizer's output is "bit-identical to the validated artifact on any hardware," the provenance question collapses to a hash compare.
- Make hardware and kernel versions part of the validated configuration. The non-associativity literature is explicit: even bit-identical weights can produce different outputs under different reduction orders. Pin the kernel; document the GPU generation; treat a CUDA major-version bump like any other modification under your Modification Protocol.
- Write the manifest check into the PCCP Modification Protocol itself. When you describe how a future update will be validated and rolled out, name the cryptographic check. FDA's December 2024 final asks for methodology; cryptographic equality of deployed-vs-validated is exactly the kind of objective, repeatable methodology the agency is looking for.
None of this requires a vendor. Steps 1, 2, 3, 5, and 6 you can implement this quarter with a Python script and a QMS update. Step 4 is the only one where the toolchain matters, and the right answer is to evaluate any contract that closes the bit-identity gap — ours or someone else's.
Closing
The deployed-equals-validated question is the question every framework — FDA SaMD, SR 11-7 in finance, defense ATO, EU AI Act high-risk Annex III — eventually arrives at. The reason it has not had a clean answer is that the dominant inference toolchain was built to optimize throughput, not to preserve a cryptographic chain of custody. That trade-off is now visible to your auditors, and the easier path is to fix it before it shows up in an inspection finding.
References
- FDA, Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions (Final Guidance, December 2024). fda.gov
- FDA / Health Canada / MHRA, Predetermined Change Control Plans for Machine Learning-Enabled Medical Devices: Guiding Principles (October 2023). fda.gov
- FDA / Health Canada / MHRA, Transparency for Machine Learning-Enabled Medical Devices: Guiding Principles (June 2024). fda.gov
- FDA, Artificial Intelligence in Software as a Medical Device (program page, current). fda.gov
- FDA, draft Total Product Lifecycle (TPLC) Management of AI/ML SaMD (January 2025).
- Chen, Y. et al., Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications, arXiv:2408.05148, 2024. arxiv.org/abs/2408.05148
Sipsa Labs is an experimental and deep tech-and-software company. UltraCompress is the first publicly-shipped product. Sipsa Inference is the second. More products in flight.