Standard : AI output quality is measured against human baseline performance
Purpose and Strategic Importance
This standard requires that AI output quality be benchmarked against the performance of a human doing the same task — whether that is a domain expert, a trained operator, or an average practitioner — to establish whether AI delivers genuine value over human alternatives. It supports the policy of measuring what AI delivers, not just what it predicts, by anchoring evaluation in the real-world context where AI and humans are alternatives or collaborators. A model that outperforms a trivial heuristic but underperforms a junior analyst provides little practical value.
Strategic Impact
- Provides a grounded, business-meaningful definition of AI success that goes beyond statistical metrics
- Informs human-AI collaboration design by identifying where AI augments, replaces, or should defer to human judgement
- Creates evidence-based arguments for AI adoption that resonate with business stakeholders and end users
- Surfaces use cases where the human performance bar is too high for current AI capability, preventing premature deployment
- Guides model improvement investment by revealing the gap between current AI performance and the human standard worth matching
Risks of Not Having This Standard
- AI systems are deployed that perform worse than the humans they were intended to augment or replace
- Business cases for AI investment rely on benchmark metrics that have no relationship to operational performance
- End users reject AI tools because they perceive them as inferior to their own judgement — and they are correct
- Organisations over-invest in AI for tasks where human performance is already highly variable and the bar is easy to exceed superficially
- The human cost of reviewing and correcting poor AI output exceeds the value the AI was expected to generate
CMMI Maturity Model
Level 1 – Initial
| Category |
Description |
| People & Culture |
- AI output quality is evaluated in isolation against statistical benchmarks with no human comparison |
| Process & Governance |
- No requirement to establish a human performance baseline; model quality is judged on loss metrics alone |
| Technology & Tools |
- Evaluation infrastructure is limited to model-level metrics; no mechanism to capture or compare human performance |
| Measurement & Metrics |
- Human performance on the target task has never been measured; the AI's relative value is unknown |
Level 2 – Managed
| Category |
Description |
| People & Culture |
- Teams discuss expected human performance informally and document assumptions about the human baseline |
| Process & Governance |
- A requirement to estimate human baseline performance is added to the use case evaluation process |
| Technology & Tools |
- Human performance data is collected through annotation studies or historical records and stored alongside model results |
| Measurement & Metrics |
- AI-to-human performance gap is calculated for key metrics; results are included in release documentation |
Level 3 – Defined
| Category |
Description |
| People & Culture |
- Human baseline measurement is a standard phase in the AI project lifecycle; domain experts are engaged to provide performance reference data |
| Process & Governance |
- A defined methodology for capturing human baseline performance (annotation studies, expert reviews, historical accuracy data) is applied per use case |
| Technology & Tools |
- Side-by-side comparison tooling enables structured human-AI evaluation; results are version-controlled and reported |
| Measurement & Metrics |
- AI performance is reported as a percentage of human baseline across multiple quality dimensions; gaps are tracked over model generations |
Level 4 – Quantitatively Managed
| Category |
Description |
| People & Culture |
- Teams set targets for AI-to-human performance parity by use case; progress against parity targets is reviewed in sprint reviews and governance forums |
| Process & Governance |
- Deployment decisions are informed by a defined minimum performance threshold relative to human baseline per risk tier |
| Technology & Tools |
- Continuous evaluation platforms track AI and human performance on the same test sets over time; drift in the human baseline is monitored |
| Measurement & Metrics |
- Parity achievement rate, performance gap trend, and human review rate changes are tracked per model as quality evidence |
Level 5 – Optimising
| Category |
Description |
| People & Culture |
- Human-AI comparative data is shared organisationally to inform decisions about task allocation between humans and AI systems |
| Process & Governance |
- Baseline standards are continuously updated as human workforce capability changes and task definitions evolve |
| Technology & Tools |
- Dynamic evaluation environments update human baselines in real time from operational performance data |
| Measurement & Metrics |
- Human-AI performance comparison data feeds workforce planning, training investment, and AI capability roadmap decisions |
Key Measures
- Percentage of AI use cases with a formally measured human performance baseline
- AI-to-human performance ratio per use case at the time of production deployment
- Number of use cases where AI performance has reached or exceeded human baseline
- Rate of human review interventions required to correct AI output in production (proxy for quality gap)
- Improvement in AI-to-human performance ratio over successive model generations per use case