• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Model performance is benchmarked against defined baselines before release

Purpose and Strategic Importance

This standard requires that every AI model be evaluated against a documented baseline — such as a previous model version, a rule-based heuristic, or a human performance benchmark — before it is released to production. It supports the policy of rigorous pre-deployment evaluation by ensuring that "better" is defined objectively rather than assumed. Without baseline comparison, teams cannot determine whether a model is genuinely improving outcomes or simply replacing one set of failure modes with another.

Strategic Impact

  • Establishes an objective definition of improvement that the organisation can hold teams accountable to
  • Prevents regressions from reaching production by catching performance degradation before deployment
  • Builds confidence among stakeholders and end users that AI systems are genuinely better than alternatives
  • Creates a continuous improvement culture where each model release must justify itself against measurable evidence
  • Reduces wasted deployment effort on models that do not deliver meaningful uplift over existing solutions

Risks of Not Having This Standard

  • Teams deploy models that perform worse than the system they replaced, eroding trust in AI initiatives
  • Stakeholders lose confidence in AI delivery because "improvement" is asserted but never demonstrated
  • Model regressions go undetected until they surface as customer complaints or operational failures
  • Engineering effort is misdirected toward model complexity when simpler baselines would suffice
  • Regulatory and audit scrutiny increases when release decisions cannot be evidenced with comparative performance data

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture - Performance evaluation is informal and subjective; individual engineers decide when a model is "good enough"
Process & Governance - No baseline definition exists; release decisions are based on gut feel or deadline pressure
Technology & Tools - Evaluation is conducted ad hoc using whatever metrics the developer chooses at the time
Measurement & Metrics - No formal metrics captured pre-release; performance claims are anecdotal

Level 2 – Managed

Category Description
People & Culture - Teams agree on a small set of metrics to evaluate before release; baselines are informally recorded
Process & Governance - A simple benchmark checklist is in place; sign-off requires comparison against the prior model version
Technology & Tools - Evaluation scripts are maintained alongside model code; results stored in a shared document
Measurement & Metrics - Core metrics (e.g. accuracy, F1, RMSE) are recorded per release and compared to the previous version

Level 3 – Defined

Category Description
People & Culture - Baseline comparison is part of the definition of done; teams understand that human and rule-based baselines are valid comparators
Process & Governance - Baselines are formally defined per use case at project inception; evaluation reports are required for all model releases
Technology & Tools - Automated evaluation pipelines generate benchmark reports on every model candidate; results are versioned alongside model artefacts
Measurement & Metrics - Multiple baseline types are tracked per use case (previous model, human expert, heuristic); performance thresholds gate promotion to production

Level 4 – Quantitatively Managed

Category Description
People & Culture - Teams are accountable to quantitative improvement targets set at the start of each model development cycle
Process & Governance - Release gates enforce minimum improvement thresholds; exceptions require documented risk acceptance
Technology & Tools - Evaluation pipelines cover statistical significance testing to prevent noise from being interpreted as improvement
Measurement & Metrics - Improvement rate over baseline is tracked per domain; teams report on variance from target and root cause analysis for underperforming releases

Level 5 – Optimising

Category Description
People & Culture - Teams continuously refine what "better" means as the problem domain matures and user needs evolve
Process & Governance - Baseline standards are reviewed at a cadence and updated to reflect advances in the field and shifts in business context
Technology & Tools - Benchmark tooling incorporates adversarial test sets, distribution shift scenarios, and real-world sample replays
Measurement & Metrics - Benchmarking data feeds organisational learning systems that inform future model development prioritisation

Key Measures

  • Percentage of model releases accompanied by a formal baseline comparison report
  • Average performance uplift over baseline across all released models per quarter
  • Number of releases blocked due to failure to meet baseline improvement threshold
  • Rate of post-deployment performance regression relative to pre-release benchmarks
  • Time taken to complete the benchmark evaluation cycle per model candidate
Associated Policies
Associated Practices
  • Adversarial Testing
  • AI Quality Gates
  • Hyperparameter Tuning Practices
  • Human Baseline Benchmarking
  • Data Quality Assessment

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering