Standard : Post-deployment model performance is monitored continuously

Purpose and Strategic Importance

This standard establishes the requirement that all AI models in production are subject to continuous performance monitoring, including tracking of prediction quality, data drift, concept drift, and system health metrics. It supports the policy of building AI systems that learn and improve continuously by ensuring that the team is never operating blind after deployment. AI models degrade silently as the world changes around them; without monitoring, that degradation becomes a business risk that compounds over time.

Strategic Impact

Enables proactive intervention before model degradation reaches the threshold of user impact
Creates the feedback signal needed to decide when retraining, recalibration, or replacement is warranted
Supports regulatory compliance in domains that require evidence of ongoing model governance
Reduces mean time to detect and resolve production AI incidents through automated alerting
Transforms deployment from a one-time event into a continuous improvement loop anchored in real-world evidence

Risks of Not Having This Standard

Model drift goes undetected for months, causing systematically degraded decisions that erode business outcomes
Incident response is reactive and slow because there is no baseline to compare against when issues are reported
Teams over-retrain or under-retrain models because they have no data-driven signal to guide retraining cadence
Regulatory scrutiny increases when organisations cannot demonstrate that deployed models are operating as intended
Trust in AI systems collapses after a high-visibility failure that monitoring would have predicted and prevented

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	- Monitoring is absent or purely reactive; teams learn of model issues through user complaints
Process & Governance	- No monitoring policy; model behaviour after deployment is assumed to remain stable
Technology & Tools	- Only infrastructure metrics (CPU, latency) are tracked; model-level performance is invisible
Measurement & Metrics	- No model performance metrics in production; post-deployment visibility is zero

Level 2 – Managed

Category	Description
People & Culture	- Teams acknowledge the need to monitor and assign responsibility for reviewing production metrics periodically
Process & Governance	- Basic monitoring is included in deployment requirements; a weekly review of model output logs is established
Technology & Tools	- Prediction logging is enabled in production; manual spot-checks of output quality are conducted
Measurement & Metrics	- A small set of proxy metrics (e.g. prediction volume, score distributions) are tracked in a dashboard

Level 3 – Defined

Category	Description
People & Culture	- Monitoring ownership is explicit per model; teams treat monitoring alerts as high-priority work items
Process & Governance	- A monitoring standard defines required metrics, alerting thresholds, and response SLAs for each model risk tier
Technology & Tools	- A model monitoring platform (e.g. Evidently, Arize, Fiddler) tracks data drift, prediction drift, and ground truth performance where labels are available
Measurement & Metrics	- Drift scores, prediction distribution statistics, and ground truth accuracy (where available) are reported continuously and compared to release baselines

Level 4 – Quantitatively Managed

Category	Description
People & Culture	- Teams set quantitative SLAs for model performance in production; breaches trigger structured incident response
Process & Governance	- Retraining triggers are defined quantitatively (e.g. drift score exceeds threshold for N consecutive days)
Technology & Tools	- Automated alerting, anomaly detection, and root cause analysis tooling is integrated into the monitoring stack
Measurement & Metrics	- Mean time to detect model degradation, mean time to retrain, and post-retrain performance recovery are measured and reported

Level 5 – Optimising

Category	Description
People & Culture	- Monitoring insights are shared as organisational learning to improve monitoring design for future models
Process & Governance	- Monitoring standards are continuously refined based on incident retrospectives and advances in drift detection methodology
Technology & Tools	- Monitoring feeds automated retraining pipelines that trigger, train, validate, and deploy updates within defined safety guardrails
Measurement & Metrics	- Monitoring data is used to forecast model lifespan and inform proactive retraining investment decisions

Key Measures

Percentage of production AI models covered by automated performance monitoring
Mean time to detect a model performance degradation event exceeding the defined threshold
Rate of model performance alerts that result in a retraining or recalibration action
Data drift score distribution across production models tracked over rolling 90-day windows
Mean time from degradation detection to resolution (retrain, recalibrate, or retire)