Practice : Human Baseline Benchmarking
Purpose and Strategic Importance
The most meaningful question about an AI system is not "is it accurate?" but "is it better than what we have now?" For tasks currently performed by humans, the relevant comparison is human performance on the same task. Without this baseline, teams have no objective way to determine whether the AI system represents genuine improvement, whether it meets the bar required for deployment, or whether it is making decisions that humans would also get wrong — or right.
Human baseline benchmarking also provides a practical upper bound on what AI performance is likely to achieve. For tasks where human agreement is high, AI systems should be able to approach human accuracy; for tasks where humans themselves disagree significantly, expecting AI to achieve high consistency may be unrealistic. Understanding inter-human agreement provides a realistic frame for interpreting model performance metrics and setting appropriate deployment expectations.
Description of the Practice
- Measures human performance on the target task using a representative sample of domain experts, producing an empirical baseline rather than relying on informal assumptions about what humans can do.
- Calculates inter-annotator agreement on the benchmarking task to establish the ceiling on expected model performance and identify tasks where human performance itself is variable or uncertain.
- Compares model performance against the human baseline using matched test sets, ensuring that model and human performance are evaluated on the same data under comparable conditions.
- Reports benchmarking results transparently in model documentation, enabling stakeholders to make informed judgements about whether AI performance is sufficient for deployment.
- Repeats human baseline benchmarking when models are significantly retrained or deployed in new contexts, ensuring that the comparison remains valid as both the model and the task evolve.
How to Practise It (Playbook)
1. Getting Started
- Identify the human experts who currently perform the task — or who are qualified to evaluate it — and recruit a representative sample to participate in benchmarking.
- Design a benchmarking protocol that measures human performance on the same test set used for model evaluation, with clear task instructions and consistent conditions across participants.
- Calculate both aggregate human performance and inter-human agreement, using the agreement metric to understand how much variability exists in human performance on the task.
- Use the human baseline to inform deployment criteria — for tasks where human performance is high and consistent, the model should be held to a high standard; for tasks with high human variability, adjust expectations accordingly.
2. Scaling and Maturing
- Develop a benchmarking programme that periodically refreshes human baseline measurements as task requirements, user populations, and domain knowledge evolve.
- Extend benchmarking to cover not just accuracy but speed, cost, and consistency — capturing the full comparative picture between AI and human performance on the task.
- Use human benchmarking data to identify which aspects of the task are hardest for both humans and AI, focusing development investment on the areas of greatest comparative advantage.
- Build benchmarking artefacts — task protocols, annotator pools, test sets — that can be reused across model versions, reducing the cost of repeating baseline comparisons over time.
3. Team Behaviours to Encourage
- Design benchmarking protocols carefully to avoid inadvertent advantage to either humans or the model — the goal is a fair comparison, not a favourable one.
- Share benchmarking results with stakeholders, including cases where the AI does not outperform humans on every dimension — partial superiority may still be sufficient for deployment if the AI advantage is on the dimensions that matter most.
- Use human benchmarking to identify cases where AI and humans make complementary errors, exploring human-AI collaboration approaches that leverage the strengths of both.
- Be explicit about the limitations of human benchmarking — the sample of human experts may not reflect the full distribution of practitioners who would perform the task in production.
4. Watch Out For…
- Using non-representative human benchmarkers — expert researchers rather than the practitioners who will interact with the AI in production — producing baselines that do not reflect real operational performance.
- Benchmarking on test sets that are too clean and curated, producing comparisons that are not representative of the messy, ambiguous inputs that characterise real production tasks.
- Interpreting human baseline comparisons too rigidly — a model that slightly underperforms humans on average may still be deployable if its performance profile is better on the highest-volume or highest-value inputs.
- Failing to account for human performance improvements over time — humans learn and improve through feedback, while a static model does not, changing the comparative picture as deployment duration increases.
5. Signals of Success
- Every AI system deployed to automate or augment human decision-making has a documented human baseline comparison using a representative sample of practitioners.
- Human baseline data is used to set deployment criteria — models must achieve defined performance relative to human baseline before being approved for production.
- Benchmarking results are communicated transparently to stakeholders, including leadership and affected user groups, enabling informed decisions about deployment and scope.
- The benchmarking process has been used to identify use cases where AI genuinely outperforms humans and cases where it does not, directing investment towards uses with real comparative advantage.
- Human baseline comparisons are refreshed when models are retrained or deployed in new contexts, maintaining a current and valid comparative picture.