Standard : Model degradation triggers are defined and monitored in production
Purpose and Strategic Importance
This standard requires that for every AI model in production, specific quantitative degradation triggers are defined before deployment and actively monitored in live operation. When a trigger is breached, a defined response — alert, escalation, automatic rollback, or retraining initiation — must occur without relying on manual detection. It supports the policy of governing AI models throughout their lifecycle by treating post-deployment governance as an engineering concern, not a periodic management review. Without defined triggers, degradation is discovered through user complaints rather than instrumentation.
Strategic Impact
- Enables proactive response to model degradation before it reaches the threshold of user impact or business harm
- Creates a contractual quality agreement between the AI team and the business about what "acceptable model performance" means in production
- Reduces mean time to detect and mean time to recover from AI performance incidents through automated alerting
- Provides the quantitative evidence needed to justify retraining investment at the right time rather than on an arbitrary schedule
- Supports lifecycle governance requirements in regulated industries where continuous model oversight is mandatory
Risks of Not Having This Standard
- Model degradation compounds silently for months before discovery, maximising harm and recovery cost
- Retraining decisions are made on gut feel or calendar schedules rather than evidence of actual degradation
- Incident response is slow because the team must first establish baseline performance before investigating the extent of the problem
- Business stakeholders lose confidence when they discover that the organisation's AI systems degrade without detection
- Regulatory scrutiny increases when organisations cannot demonstrate that they have mechanisms to detect degrading model performance
CMMI Maturity Model
Level 1 – Initial
| Category |
Description |
| People & Culture |
- Model degradation is detected through user complaints or periodic manual spot-checks; there is no proactive detection |
| Process & Governance |
- No degradation trigger policy; the team has no formal agreement about what level of performance decline constitutes a problem |
| Technology & Tools |
- Production monitoring is limited to infrastructure metrics (latency, error rate); model quality metrics are absent |
| Measurement & Metrics |
- No production model quality metrics; degradation cannot be quantified until it has caused visible harm |
Level 2 – Managed
| Category |
Description |
| People & Culture |
- Teams identify the key quality metrics for each production model and discuss threshold levels informally |
| Process & Governance |
- A requirement to define at least one degradation trigger per production model is established; triggers are documented at deployment |
| Technology & Tools |
- Basic metric dashboards display proxy metrics (prediction score distributions, volume anomalies) that can indicate degradation |
| Measurement & Metrics |
- Trigger thresholds are defined per model; alerts are sent when thresholds are breached, though response procedures are informal |
Level 3 – Defined
| Category |
Description |
| People & Culture |
- Degradation trigger definition is part of the deployment readiness checklist; triggers are agreed between ML, product, and operations teams |
| Process & Governance |
- A formal trigger definition standard specifies required metric types (data drift, prediction drift, ground truth performance) and response procedures per trigger type |
| Technology & Tools |
- Model monitoring platforms track defined triggers automatically; automated alerts are routed to on-call channels with context to support rapid response |
| Measurement & Metrics |
- Trigger breach rate, alert response time, and false positive rate are tracked per model and reviewed in operational reviews |
Level 4 – Quantitatively Managed
| Category |
Description |
| People & Culture |
- Teams are accountable for trigger coverage and response SLAs; trigger effectiveness is reviewed quarterly using incident retrospective data |
| Process & Governance |
- Trigger thresholds are calibrated quantitatively based on the cost of false positives (unnecessary retraining) and false negatives (undetected degradation) |
| Technology & Tools |
- Multi-metric anomaly detection combines multiple signals to reduce false positive rates while maintaining sensitivity |
| Measurement & Metrics |
- Mean time to detect degradation, mean time to recover, and trigger sensitivity and specificity are measured and reported |
Level 5 – Optimising
| Category |
Description |
| People & Culture |
- Trigger design knowledge is shared across teams; the organisation develops a library of effective trigger patterns per AI use case type |
| Process & Governance |
- Trigger definitions are continuously refined based on incident retrospectives and advances in drift detection methodology |
| Technology & Tools |
- Adaptive trigger systems adjust thresholds dynamically based on seasonal patterns and known environmental changes |
| Measurement & Metrics |
- Long-term data on trigger effectiveness is used to build predictive models of when specific model types are likely to degrade, enabling proactive retraining |
Key Measures
- Percentage of production AI models with at least one formally defined and actively monitored degradation trigger
- Mean time to detect a degradation event from the point at which it first exceeded the trigger threshold
- Trigger false positive rate (alerts raised that did not correspond to genuine degradation requiring intervention)
- Trigger false negative rate (degradation events that were not detected by triggers before user impact)
- Mean time to recover from a triggered degradation event (retrain, recalibrate, or rollback)