Standard : Post-deployment model performance is monitored continuously
Purpose and Strategic Importance
This standard establishes the requirement that all AI models in production are subject to continuous performance monitoring, including tracking of prediction quality, data drift, concept drift, and system health metrics. It supports the policy of building AI systems that learn and improve continuously by ensuring that the team is never operating blind after deployment. AI models degrade silently as the world changes around them; without monitoring, that degradation becomes a business risk that compounds over time.
Strategic Impact
- Enables proactive intervention before model degradation reaches the threshold of user impact
- Creates the feedback signal needed to decide when retraining, recalibration, or replacement is warranted
- Supports regulatory compliance in domains that require evidence of ongoing model governance
- Reduces mean time to detect and resolve production AI incidents through automated alerting
- Transforms deployment from a one-time event into a continuous improvement loop anchored in real-world evidence
Risks of Not Having This Standard
- Model drift goes undetected for months, causing systematically degraded decisions that erode business outcomes
- Incident response is reactive and slow because there is no baseline to compare against when issues are reported
- Teams over-retrain or under-retrain models because they have no data-driven signal to guide retraining cadence
- Regulatory scrutiny increases when organisations cannot demonstrate that deployed models are operating as intended
- Trust in AI systems collapses after a high-visibility failure that monitoring would have predicted and prevented
CMMI Maturity Model
Level 1 – Initial
| Category |
Description |
| People & Culture |
- Monitoring is absent or purely reactive; teams learn of model issues through user complaints |
| Process & Governance |
- No monitoring policy; model behaviour after deployment is assumed to remain stable |
| Technology & Tools |
- Only infrastructure metrics (CPU, latency) are tracked; model-level performance is invisible |
| Measurement & Metrics |
- No model performance metrics in production; post-deployment visibility is zero |
Level 2 – Managed
| Category |
Description |
| People & Culture |
- Teams acknowledge the need to monitor and assign responsibility for reviewing production metrics periodically |
| Process & Governance |
- Basic monitoring is included in deployment requirements; a weekly review of model output logs is established |
| Technology & Tools |
- Prediction logging is enabled in production; manual spot-checks of output quality are conducted |
| Measurement & Metrics |
- A small set of proxy metrics (e.g. prediction volume, score distributions) are tracked in a dashboard |
Level 3 – Defined
| Category |
Description |
| People & Culture |
- Monitoring ownership is explicit per model; teams treat monitoring alerts as high-priority work items |
| Process & Governance |
- A monitoring standard defines required metrics, alerting thresholds, and response SLAs for each model risk tier |
| Technology & Tools |
- A model monitoring platform (e.g. Evidently, Arize, Fiddler) tracks data drift, prediction drift, and ground truth performance where labels are available |
| Measurement & Metrics |
- Drift scores, prediction distribution statistics, and ground truth accuracy (where available) are reported continuously and compared to release baselines |
Level 4 – Quantitatively Managed
| Category |
Description |
| People & Culture |
- Teams set quantitative SLAs for model performance in production; breaches trigger structured incident response |
| Process & Governance |
- Retraining triggers are defined quantitatively (e.g. drift score exceeds threshold for N consecutive days) |
| Technology & Tools |
- Automated alerting, anomaly detection, and root cause analysis tooling is integrated into the monitoring stack |
| Measurement & Metrics |
- Mean time to detect model degradation, mean time to retrain, and post-retrain performance recovery are measured and reported |
Level 5 – Optimising
| Category |
Description |
| People & Culture |
- Monitoring insights are shared as organisational learning to improve monitoring design for future models |
| Process & Governance |
- Monitoring standards are continuously refined based on incident retrospectives and advances in drift detection methodology |
| Technology & Tools |
- Monitoring feeds automated retraining pipelines that trigger, train, validate, and deploy updates within defined safety guardrails |
| Measurement & Metrics |
- Monitoring data is used to forecast model lifespan and inform proactive retraining investment decisions |
Key Measures
- Percentage of production AI models covered by automated performance monitoring
- Mean time to detect a model performance degradation event exceeding the defined threshold
- Rate of model performance alerts that result in a retraining or recalibration action
- Data drift score distribution across production models tracked over rolling 90-day windows
- Mean time from degradation detection to resolution (retrain, recalibrate, or retire)