Standard : AI output quality is measured against human baseline performance

Purpose and Strategic Importance

This standard requires that AI output quality be benchmarked against the performance of a human doing the same task — whether that is a domain expert, a trained operator, or an average practitioner — to establish whether AI delivers genuine value over human alternatives. It supports the policy of measuring what AI delivers, not just what it predicts, by anchoring evaluation in the real-world context where AI and humans are alternatives or collaborators. A model that outperforms a trivial heuristic but underperforms a junior analyst provides little practical value.

Strategic Impact

Provides a grounded, business-meaningful definition of AI success that goes beyond statistical metrics
Informs human-AI collaboration design by identifying where AI augments, replaces, or should defer to human judgement
Creates evidence-based arguments for AI adoption that resonate with business stakeholders and end users
Surfaces use cases where the human performance bar is too high for current AI capability, preventing premature deployment
Guides model improvement investment by revealing the gap between current AI performance and the human standard worth matching

Risks of Not Having This Standard

AI systems are deployed that perform worse than the humans they were intended to augment or replace
Business cases for AI investment rely on benchmark metrics that have no relationship to operational performance
End users reject AI tools because they perceive them as inferior to their own judgement — and they are correct
Organisations over-invest in AI for tasks where human performance is already highly variable and the bar is easy to exceed superficially
The human cost of reviewing and correcting poor AI output exceeds the value the AI was expected to generate

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	- AI output quality is evaluated in isolation against statistical benchmarks with no human comparison
Process & Governance	- No requirement to establish a human performance baseline; model quality is judged on loss metrics alone
Technology & Tools	- Evaluation infrastructure is limited to model-level metrics; no mechanism to capture or compare human performance
Measurement & Metrics	- Human performance on the target task has never been measured; the AI's relative value is unknown

Level 2 – Managed

Category	Description
People & Culture	- Teams discuss expected human performance informally and document assumptions about the human baseline
Process & Governance	- A requirement to estimate human baseline performance is added to the use case evaluation process
Technology & Tools	- Human performance data is collected through annotation studies or historical records and stored alongside model results
Measurement & Metrics	- AI-to-human performance gap is calculated for key metrics; results are included in release documentation

Level 3 – Defined

Category	Description
People & Culture	- Human baseline measurement is a standard phase in the AI project lifecycle; domain experts are engaged to provide performance reference data
Process & Governance	- A defined methodology for capturing human baseline performance (annotation studies, expert reviews, historical accuracy data) is applied per use case
Technology & Tools	- Side-by-side comparison tooling enables structured human-AI evaluation; results are version-controlled and reported
Measurement & Metrics	- AI performance is reported as a percentage of human baseline across multiple quality dimensions; gaps are tracked over model generations

Level 4 – Quantitatively Managed

Category	Description
People & Culture	- Teams set targets for AI-to-human performance parity by use case; progress against parity targets is reviewed in sprint reviews and governance forums
Process & Governance	- Deployment decisions are informed by a defined minimum performance threshold relative to human baseline per risk tier
Technology & Tools	- Continuous evaluation platforms track AI and human performance on the same test sets over time; drift in the human baseline is monitored
Measurement & Metrics	- Parity achievement rate, performance gap trend, and human review rate changes are tracked per model as quality evidence

Level 5 – Optimising

Category	Description
People & Culture	- Human-AI comparative data is shared organisationally to inform decisions about task allocation between humans and AI systems
Process & Governance	- Baseline standards are continuously updated as human workforce capability changes and task definitions evolve
Technology & Tools	- Dynamic evaluation environments update human baselines in real time from operational performance data
Measurement & Metrics	- Human-AI performance comparison data feeds workforce planning, training investment, and AI capability roadmap decisions

Key Measures

Percentage of AI use cases with a formally measured human performance baseline
AI-to-human performance ratio per use case at the time of production deployment
Number of use cases where AI performance has reached or exceeded human baseline
Rate of human review interventions required to correct AI output in production (proxy for quality gap)
Improvement in AI-to-human performance ratio over successive model generations per use case