Standard : Training data quality is validated before model development begins

Purpose and Strategic Importance

This standard mandates that training data must pass defined quality checks — covering completeness, correctness, representativeness, and absence of harmful bias — before any model development work begins. It supports the policy of treating data quality as a first-class concern by preventing the well-documented "garbage in, garbage out" failure mode that undermines AI credibility. Teams that skip data validation waste engineering cycles building models on foundations that cannot support reliable predictions.

Strategic Impact

Prevents model development effort from being wasted on data that will produce unreliable or biased outputs
Reduces the likelihood of costly mid-project restarts when data quality problems are discovered late
Creates a shared understanding of data standards that bridges data engineering and machine learning teams
Supports fairness and compliance goals by surfacing demographic gaps and proxy variables before training
Builds organisational capability in data profiling and curation that benefits all analytics work, not just AI

Risks of Not Having This Standard

Models trained on incomplete or mislabelled data produce systematically incorrect predictions in production
Biased training data propagates and amplifies discriminatory patterns at scale
Late discovery of data quality issues causes project delays and budget overruns
Compliance failures occur when regulated use cases are built on data that violates retention or consent requirements
Teams lose credibility with business stakeholders when AI systems fail in ways attributable to poor data hygiene

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	- Data is used for training without systematic review; quality concerns are raised informally if at all
Process & Governance	- No data validation gate exists before model development; data provenance is undocumented
Technology & Tools	- Validation is conducted manually using spreadsheets or ad hoc scripts if conducted at all
Measurement & Metrics	- No data quality metrics are tracked; issues are discovered through model failure rather than upfront checks

Level 2 – Managed

Category	Description
People & Culture	- Data engineers and ML engineers discuss data quality at project kickoff; known issues are documented
Process & Governance	- A data readiness checklist exists covering completeness and null rates; sign-off required before model training starts
Technology & Tools	- Basic profiling scripts check row counts, null rates, and value distributions; outputs reviewed by the team
Measurement & Metrics	- Completeness and null rate thresholds are defined; datasets that fail are sent back for remediation

Level 3 – Defined

Category	Description
People & Culture	- Data quality responsibility is shared between data engineering and ML teams; a data quality owner is assigned per project
Process & Governance	- A formal data validation framework covers completeness, consistency, timeliness, representativeness, and bias screening
Technology & Tools	- Automated data validation pipelines run on every dataset; results are logged and must meet defined thresholds before training is permitted
Measurement & Metrics	- Validation reports are produced per dataset covering all framework dimensions; pass/fail status is visible to project stakeholders

Level 4 – Quantitatively Managed

Category	Description
People & Culture	- Teams set quantitative data quality targets at project inception; failure to meet targets triggers escalation
Process & Governance	- Data quality gates are enforced in the ML pipeline; training jobs are blocked until validation passes
Technology & Tools	- Advanced profiling includes distribution shift detection, label quality assessment, and fairness-aware sampling analysis
Measurement & Metrics	- Data quality scores per dimension are tracked over time; trends inform data engineering investment priorities

Level 5 – Optimising

Category	Description
People & Culture	- Teams proactively invest in data quality improvement as a competitive advantage; learnings are shared across the organisation
Process & Governance	- Data validation standards are continuously refined based on model failure post-mortems and evolving regulatory guidance
Technology & Tools	- Automated data lineage tools track quality from source to training; anomaly detection flags degradation in production data feeds before it affects models
Measurement & Metrics	- Data quality metrics are correlated with model performance outcomes, enabling evidence-based quality investment decisions

Key Measures

Percentage of model development projects that completed a formal data validation gate before training began
Average data quality score (completeness, correctness, representativeness) across datasets entering training pipelines
Number of model restarts or delays attributed to data quality issues discovered after training commenced
Rate of bias-related findings detected at data validation versus discovered post-deployment
Mean time to remediate a failed data validation check before model development can proceed