Standard : Bias and fairness assessments are conducted at every model release
Purpose and Strategic Importance
This standard mandates that every AI model release includes a formal bias and fairness assessment that evaluates model performance across relevant demographic and contextual subgroups before the release reaches production. It supports the policy of managing bias as an ongoing operational concern by recognising that bias is not a one-time problem to be solved at the start of a project but a property that must be re-evaluated whenever the model, data, or deployment context changes. A model that was fair at launch can become unfair after retraining on shifted data or deployment in a new context.
Strategic Impact
- Protects individuals from discriminatory outcomes caused by AI systems that perform unequally across demographic groups
- Reduces legal and regulatory exposure in jurisdictions where automated decision-making is subject to equality legislation
- Builds user trust, particularly among communities who have historically been harmed by biased algorithmic systems
- Creates an organisational capability to identify and address bias systematically rather than reacting to incidents
- Ensures that model improvement initiatives do not inadvertently introduce new fairness disparities while fixing other issues
Risks of Not Having This Standard
- Biased model releases cause discriminatory outcomes at scale before the issue is surfaced through complaints or media scrutiny
- Regulatory investigations and enforcement actions result from failure to demonstrate fairness due diligence
- Reputational damage from bias incidents is disproportionately severe when it emerges that no formal assessment was conducted
- Model improvements in aggregate performance mask growing disparities for subgroups that are not separately measured
- Teams lack the data to distinguish bias introduced by model changes from bias present in the original system
CMMI Maturity Model
Level 1 – Initial
| Category |
Description |
| People & Culture |
- Bias is acknowledged as a concern in principle but not systematically addressed; assessment is informal and ad hoc |
| Process & Governance |
- No fairness assessment requirement in the release process; bias checking is left to individual engineer discretion |
| Technology & Tools |
- No dedicated fairness tooling; subgroup analysis is not performed as standard |
| Measurement & Metrics |
- Only aggregate performance metrics are reported; subgroup performance disparities are invisible |
Level 2 – Managed
| Category |
Description |
| People & Culture |
- Teams identify relevant protected characteristics and demographic groups for their use case; fairness considerations are discussed at release reviews |
| Process & Governance |
- A fairness checklist is added to the release process; teams must document which subgroups were evaluated and the results |
| Technology & Tools |
- Basic subgroup performance analysis is conducted using standard disaggregation of evaluation metrics |
| Measurement & Metrics |
- Performance metrics are reported separately for key demographic subgroups; disparities above a defined threshold are flagged |
Level 3 – Defined
| Category |
Description |
| People & Culture |
- Fairness assessment is owned by a named individual per release; affected communities are consulted in defining fairness criteria |
| Process & Governance |
- A formal fairness assessment framework defines applicable fairness metrics (demographic parity, equalised odds, etc.) per use case type; releases that fail thresholds are blocked |
| Technology & Tools |
- Dedicated fairness tooling (e.g. Fairlearn, AI Fairness 360) is integrated into the evaluation pipeline; reports are generated automatically |
| Measurement & Metrics |
- Multiple fairness metrics are reported per release; trends across model versions are tracked to detect deterioration |
Level 4 – Quantitatively Managed
| Category |
Description |
| People & Culture |
- Fairness is a shared engineering and product responsibility; teams are accountable for fairness performance in the same way as accuracy and reliability |
| Process & Governance |
- Fairness thresholds are defined quantitatively per use case; threshold breaches trigger a formal review before any production deployment proceeds |
| Technology & Tools |
- Counterfactual fairness, causal analysis, and intersectionality testing tools are applied to high-risk use cases |
| Measurement & Metrics |
- Fairness metric trends, threshold breach rates, and time-to-remediation for fairness issues are tracked and reported in governance forums |
Level 5 – Optimising
| Category |
Description |
| People & Culture |
- Fairness learnings are shared across the organisation; teams contribute to the improvement of fairness standards based on deployment experience |
| Process & Governance |
- Fairness standards are continuously updated based on regulatory developments, academic research, and community feedback |
| Technology & Tools |
- Continuous fairness monitoring in production detects demographic performance drift and triggers alerts before disparities become material |
| Measurement & Metrics |
- Long-term outcome equity data (not just model performance equity) is tracked to assess whether AI decisions are producing fair real-world results |
Key Measures
- Percentage of model releases accompanied by a formal bias and fairness assessment report
- Number of releases blocked or modified due to fairness threshold breaches
- Maximum demographic parity gap and equalised odds gap across key subgroups per released model
- Time to remediate a fairness issue from identification to re-release
- Rate of fairness-related incidents in production attributed to models that passed the pre-release fairness assessment (calibration metric)