Standard : Model performance is benchmarked against defined baselines before release

Purpose and Strategic Importance

This standard requires that every AI model be evaluated against a documented baseline — such as a previous model version, a rule-based heuristic, or a human performance benchmark — before it is released to production. It supports the policy of rigorous pre-deployment evaluation by ensuring that "better" is defined objectively rather than assumed. Without baseline comparison, teams cannot determine whether a model is genuinely improving outcomes or simply replacing one set of failure modes with another.

Strategic Impact

Establishes an objective definition of improvement that the organisation can hold teams accountable to
Prevents regressions from reaching production by catching performance degradation before deployment
Builds confidence among stakeholders and end users that AI systems are genuinely better than alternatives
Creates a continuous improvement culture where each model release must justify itself against measurable evidence
Reduces wasted deployment effort on models that do not deliver meaningful uplift over existing solutions

Risks of Not Having This Standard

Teams deploy models that perform worse than the system they replaced, eroding trust in AI initiatives
Stakeholders lose confidence in AI delivery because "improvement" is asserted but never demonstrated
Model regressions go undetected until they surface as customer complaints or operational failures
Engineering effort is misdirected toward model complexity when simpler baselines would suffice
Regulatory and audit scrutiny increases when release decisions cannot be evidenced with comparative performance data

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	- Performance evaluation is informal and subjective; individual engineers decide when a model is "good enough"
Process & Governance	- No baseline definition exists; release decisions are based on gut feel or deadline pressure
Technology & Tools	- Evaluation is conducted ad hoc using whatever metrics the developer chooses at the time
Measurement & Metrics	- No formal metrics captured pre-release; performance claims are anecdotal

Level 2 – Managed

Category	Description
People & Culture	- Teams agree on a small set of metrics to evaluate before release; baselines are informally recorded
Process & Governance	- A simple benchmark checklist is in place; sign-off requires comparison against the prior model version
Technology & Tools	- Evaluation scripts are maintained alongside model code; results stored in a shared document
Measurement & Metrics	- Core metrics (e.g. accuracy, F1, RMSE) are recorded per release and compared to the previous version

Level 3 – Defined

Category	Description
People & Culture	- Baseline comparison is part of the definition of done; teams understand that human and rule-based baselines are valid comparators
Process & Governance	- Baselines are formally defined per use case at project inception; evaluation reports are required for all model releases
Technology & Tools	- Automated evaluation pipelines generate benchmark reports on every model candidate; results are versioned alongside model artefacts
Measurement & Metrics	- Multiple baseline types are tracked per use case (previous model, human expert, heuristic); performance thresholds gate promotion to production

Level 4 – Quantitatively Managed

Category	Description
People & Culture	- Teams are accountable to quantitative improvement targets set at the start of each model development cycle
Process & Governance	- Release gates enforce minimum improvement thresholds; exceptions require documented risk acceptance
Technology & Tools	- Evaluation pipelines cover statistical significance testing to prevent noise from being interpreted as improvement
Measurement & Metrics	- Improvement rate over baseline is tracked per domain; teams report on variance from target and root cause analysis for underperforming releases

Level 5 – Optimising

Category	Description
People & Culture	- Teams continuously refine what "better" means as the problem domain matures and user needs evolve
Process & Governance	- Baseline standards are reviewed at a cadence and updated to reflect advances in the field and shifts in business context
Technology & Tools	- Benchmark tooling incorporates adversarial test sets, distribution shift scenarios, and real-world sample replays
Measurement & Metrics	- Benchmarking data feeds organisational learning systems that inform future model development prioritisation

Key Measures

Percentage of model releases accompanied by a formal baseline comparison report
Average performance uplift over baseline across all released models per quarter
Number of releases blocked due to failure to meet baseline improvement threshold
Rate of post-deployment performance regression relative to pre-release benchmarks
Time taken to complete the benchmark evaluation cycle per model candidate