Standard : Training data quality is validated before model development begins
Purpose and Strategic Importance
This standard mandates that training data must pass defined quality checks — covering completeness, correctness, representativeness, and absence of harmful bias — before any model development work begins. It supports the policy of treating data quality as a first-class concern by preventing the well-documented "garbage in, garbage out" failure mode that undermines AI credibility. Teams that skip data validation waste engineering cycles building models on foundations that cannot support reliable predictions.
Strategic Impact
- Prevents model development effort from being wasted on data that will produce unreliable or biased outputs
- Reduces the likelihood of costly mid-project restarts when data quality problems are discovered late
- Creates a shared understanding of data standards that bridges data engineering and machine learning teams
- Supports fairness and compliance goals by surfacing demographic gaps and proxy variables before training
- Builds organisational capability in data profiling and curation that benefits all analytics work, not just AI
Risks of Not Having This Standard
- Models trained on incomplete or mislabelled data produce systematically incorrect predictions in production
- Biased training data propagates and amplifies discriminatory patterns at scale
- Late discovery of data quality issues causes project delays and budget overruns
- Compliance failures occur when regulated use cases are built on data that violates retention or consent requirements
- Teams lose credibility with business stakeholders when AI systems fail in ways attributable to poor data hygiene
CMMI Maturity Model
Level 1 – Initial
| Category |
Description |
| People & Culture |
- Data is used for training without systematic review; quality concerns are raised informally if at all |
| Process & Governance |
- No data validation gate exists before model development; data provenance is undocumented |
| Technology & Tools |
- Validation is conducted manually using spreadsheets or ad hoc scripts if conducted at all |
| Measurement & Metrics |
- No data quality metrics are tracked; issues are discovered through model failure rather than upfront checks |
Level 2 – Managed
| Category |
Description |
| People & Culture |
- Data engineers and ML engineers discuss data quality at project kickoff; known issues are documented |
| Process & Governance |
- A data readiness checklist exists covering completeness and null rates; sign-off required before model training starts |
| Technology & Tools |
- Basic profiling scripts check row counts, null rates, and value distributions; outputs reviewed by the team |
| Measurement & Metrics |
- Completeness and null rate thresholds are defined; datasets that fail are sent back for remediation |
Level 3 – Defined
| Category |
Description |
| People & Culture |
- Data quality responsibility is shared between data engineering and ML teams; a data quality owner is assigned per project |
| Process & Governance |
- A formal data validation framework covers completeness, consistency, timeliness, representativeness, and bias screening |
| Technology & Tools |
- Automated data validation pipelines run on every dataset; results are logged and must meet defined thresholds before training is permitted |
| Measurement & Metrics |
- Validation reports are produced per dataset covering all framework dimensions; pass/fail status is visible to project stakeholders |
Level 4 – Quantitatively Managed
| Category |
Description |
| People & Culture |
- Teams set quantitative data quality targets at project inception; failure to meet targets triggers escalation |
| Process & Governance |
- Data quality gates are enforced in the ML pipeline; training jobs are blocked until validation passes |
| Technology & Tools |
- Advanced profiling includes distribution shift detection, label quality assessment, and fairness-aware sampling analysis |
| Measurement & Metrics |
- Data quality scores per dimension are tracked over time; trends inform data engineering investment priorities |
Level 5 – Optimising
| Category |
Description |
| People & Culture |
- Teams proactively invest in data quality improvement as a competitive advantage; learnings are shared across the organisation |
| Process & Governance |
- Data validation standards are continuously refined based on model failure post-mortems and evolving regulatory guidance |
| Technology & Tools |
- Automated data lineage tools track quality from source to training; anomaly detection flags degradation in production data feeds before it affects models |
| Measurement & Metrics |
- Data quality metrics are correlated with model performance outcomes, enabling evidence-based quality investment decisions |
Key Measures
- Percentage of model development projects that completed a formal data validation gate before training began
- Average data quality score (completeness, correctness, representativeness) across datasets entering training pipelines
- Number of model restarts or delays attributed to data quality issues discovered after training commenced
- Rate of bias-related findings detected at data validation versus discovered post-deployment
- Mean time to remediate a failed data validation check before model development can proceed