Ragan McGill

Policy : Evaluate AI Models Rigorously Before Deployment

Commitment to Rigorous Pre-Deployment Evaluation Deploying an AI model is not the same as deploying a feature. A feature either works or it does not — the logic is deterministic and the test cases are finite. An AI model produces probabilistic outputs across an effectively infinite input space, with failure modes that may only manifest in production under specific conditions. Our commitment is to establish and enforce rigorous evaluation gates that give us genuine confidence in model behaviour before it reaches users — not confidence theatre driven by headline accuracy numbers.

What This Means Rigorous evaluation means going beyond standard train/test splits and aggregate accuracy metrics. It means stress-testing models against adversarial inputs, evaluating fairness across demographic subgroups, testing behaviour at distribution boundaries, and validating that the model's real-world operating conditions match the assumptions baked into its training data. Models that pass a single aggregate metric do not automatically pass our deployment gate — they must demonstrate reliable, safe behaviour across the conditions that matter.

Our commitment to rigorous pre-deployment evaluation is built on:

Defined Evaluation Criteria – Before any model development begins, we define what good looks like: the metrics, thresholds, and behavioural requirements a model must meet before it can be deployed. Evaluation criteria are agreed by product, engineering, and domain experts — not set unilaterally by the team that built the model.
Held-Out Evaluation Sets – Evaluation data is strictly separated from training data. We maintain curated, representative evaluation datasets that reflect real-world conditions, edge cases, and known difficult scenarios — not just the easy examples.
Subgroup and Fairness Analysis – Models are evaluated across relevant subgroups to identify differential performance, bias, or harm that aggregate metrics may obscure. Fairness is not assumed from overall accuracy.
Adversarial and Edge Case Testing – We actively probe models with inputs designed to expose failure modes: out-of-distribution examples, adversarial perturbations, ambiguous cases, and known-difficult scenarios identified from domain expertise.
Comparison Against Baselines – New models are always evaluated against the current production model or a well-understood baseline. A new model must demonstrably outperform what it replaces — not just perform acceptably in isolation.
Independent Evaluation Review – Where stakes are material, model evaluation is reviewed by someone not directly involved in building the model. This catches evaluation errors, metric gaming, and blind spots that self-assessment misses.
Documentation of Known Limitations – Every model that passes evaluation is accompanied by documentation of its known limitations, failure modes, and operating conditions. Deploying a model means accepting its documented constraints — not ignoring them.

Why This Matters The cost of deploying a poorly evaluated model is not symmetric. A model that fails in subtle or systematic ways can harm users, erode trust, trigger regulatory scrutiny, and create reputational damage that dwarfs the cost of a longer evaluation process. The history of AI is littered with examples of models that performed brilliantly on benchmark datasets and failed catastrophically in production. Rigorous evaluation is the mechanism by which we prevent that history from repeating itself in our systems.

Our Expectation No AI model is deployed to production without passing a documented evaluation process that meets the standards defined for its risk tier. Teams that skip, abbreviate, or self-certify evaluation are not moving faster — they are borrowing against future crises. Building AI systems that have genuinely earned the right to go live is how we deliver outcomes that are reliably, demonstrably Better.