Standard : Model performance is benchmarked against defined baselines before release
Purpose and Strategic Importance
This standard requires that every AI model be evaluated against a documented baseline — such as a previous model version, a rule-based heuristic, or a human performance benchmark — before it is released to production. It supports the policy of rigorous pre-deployment evaluation by ensuring that "better" is defined objectively rather than assumed. Without baseline comparison, teams cannot determine whether a model is genuinely improving outcomes or simply replacing one set of failure modes with another.
Strategic Impact
- Establishes an objective definition of improvement that the organisation can hold teams accountable to
- Prevents regressions from reaching production by catching performance degradation before deployment
- Builds confidence among stakeholders and end users that AI systems are genuinely better than alternatives
- Creates a continuous improvement culture where each model release must justify itself against measurable evidence
- Reduces wasted deployment effort on models that do not deliver meaningful uplift over existing solutions
Risks of Not Having This Standard
- Teams deploy models that perform worse than the system they replaced, eroding trust in AI initiatives
- Stakeholders lose confidence in AI delivery because "improvement" is asserted but never demonstrated
- Model regressions go undetected until they surface as customer complaints or operational failures
- Engineering effort is misdirected toward model complexity when simpler baselines would suffice
- Regulatory and audit scrutiny increases when release decisions cannot be evidenced with comparative performance data
CMMI Maturity Model
Level 1 – Initial
| Category |
Description |
| People & Culture |
- Performance evaluation is informal and subjective; individual engineers decide when a model is "good enough" |
| Process & Governance |
- No baseline definition exists; release decisions are based on gut feel or deadline pressure |
| Technology & Tools |
- Evaluation is conducted ad hoc using whatever metrics the developer chooses at the time |
| Measurement & Metrics |
- No formal metrics captured pre-release; performance claims are anecdotal |
Level 2 – Managed
| Category |
Description |
| People & Culture |
- Teams agree on a small set of metrics to evaluate before release; baselines are informally recorded |
| Process & Governance |
- A simple benchmark checklist is in place; sign-off requires comparison against the prior model version |
| Technology & Tools |
- Evaluation scripts are maintained alongside model code; results stored in a shared document |
| Measurement & Metrics |
- Core metrics (e.g. accuracy, F1, RMSE) are recorded per release and compared to the previous version |
Level 3 – Defined
| Category |
Description |
| People & Culture |
- Baseline comparison is part of the definition of done; teams understand that human and rule-based baselines are valid comparators |
| Process & Governance |
- Baselines are formally defined per use case at project inception; evaluation reports are required for all model releases |
| Technology & Tools |
- Automated evaluation pipelines generate benchmark reports on every model candidate; results are versioned alongside model artefacts |
| Measurement & Metrics |
- Multiple baseline types are tracked per use case (previous model, human expert, heuristic); performance thresholds gate promotion to production |
Level 4 – Quantitatively Managed
| Category |
Description |
| People & Culture |
- Teams are accountable to quantitative improvement targets set at the start of each model development cycle |
| Process & Governance |
- Release gates enforce minimum improvement thresholds; exceptions require documented risk acceptance |
| Technology & Tools |
- Evaluation pipelines cover statistical significance testing to prevent noise from being interpreted as improvement |
| Measurement & Metrics |
- Improvement rate over baseline is tracked per domain; teams report on variance from target and root cause analysis for underperforming releases |
Level 5 – Optimising
| Category |
Description |
| People & Culture |
- Teams continuously refine what "better" means as the problem domain matures and user needs evolve |
| Process & Governance |
- Baseline standards are reviewed at a cadence and updated to reflect advances in the field and shifts in business context |
| Technology & Tools |
- Benchmark tooling incorporates adversarial test sets, distribution shift scenarios, and real-world sample replays |
| Measurement & Metrics |
- Benchmarking data feeds organisational learning systems that inform future model development prioritisation |
Key Measures
- Percentage of model releases accompanied by a formal baseline comparison report
- Average performance uplift over baseline across all released models per quarter
- Number of releases blocked due to failure to meet baseline improvement threshold
- Rate of post-deployment performance regression relative to pre-release benchmarks
- Time taken to complete the benchmark evaluation cycle per model candidate