• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Training data quality is validated before model development begins

Purpose and Strategic Importance

This standard mandates that training data must pass defined quality checks — covering completeness, correctness, representativeness, and absence of harmful bias — before any model development work begins. It supports the policy of treating data quality as a first-class concern by preventing the well-documented "garbage in, garbage out" failure mode that undermines AI credibility. Teams that skip data validation waste engineering cycles building models on foundations that cannot support reliable predictions.

Strategic Impact

  • Prevents model development effort from being wasted on data that will produce unreliable or biased outputs
  • Reduces the likelihood of costly mid-project restarts when data quality problems are discovered late
  • Creates a shared understanding of data standards that bridges data engineering and machine learning teams
  • Supports fairness and compliance goals by surfacing demographic gaps and proxy variables before training
  • Builds organisational capability in data profiling and curation that benefits all analytics work, not just AI

Risks of Not Having This Standard

  • Models trained on incomplete or mislabelled data produce systematically incorrect predictions in production
  • Biased training data propagates and amplifies discriminatory patterns at scale
  • Late discovery of data quality issues causes project delays and budget overruns
  • Compliance failures occur when regulated use cases are built on data that violates retention or consent requirements
  • Teams lose credibility with business stakeholders when AI systems fail in ways attributable to poor data hygiene

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture - Data is used for training without systematic review; quality concerns are raised informally if at all
Process & Governance - No data validation gate exists before model development; data provenance is undocumented
Technology & Tools - Validation is conducted manually using spreadsheets or ad hoc scripts if conducted at all
Measurement & Metrics - No data quality metrics are tracked; issues are discovered through model failure rather than upfront checks

Level 2 – Managed

Category Description
People & Culture - Data engineers and ML engineers discuss data quality at project kickoff; known issues are documented
Process & Governance - A data readiness checklist exists covering completeness and null rates; sign-off required before model training starts
Technology & Tools - Basic profiling scripts check row counts, null rates, and value distributions; outputs reviewed by the team
Measurement & Metrics - Completeness and null rate thresholds are defined; datasets that fail are sent back for remediation

Level 3 – Defined

Category Description
People & Culture - Data quality responsibility is shared between data engineering and ML teams; a data quality owner is assigned per project
Process & Governance - A formal data validation framework covers completeness, consistency, timeliness, representativeness, and bias screening
Technology & Tools - Automated data validation pipelines run on every dataset; results are logged and must meet defined thresholds before training is permitted
Measurement & Metrics - Validation reports are produced per dataset covering all framework dimensions; pass/fail status is visible to project stakeholders

Level 4 – Quantitatively Managed

Category Description
People & Culture - Teams set quantitative data quality targets at project inception; failure to meet targets triggers escalation
Process & Governance - Data quality gates are enforced in the ML pipeline; training jobs are blocked until validation passes
Technology & Tools - Advanced profiling includes distribution shift detection, label quality assessment, and fairness-aware sampling analysis
Measurement & Metrics - Data quality scores per dimension are tracked over time; trends inform data engineering investment priorities

Level 5 – Optimising

Category Description
People & Culture - Teams proactively invest in data quality improvement as a competitive advantage; learnings are shared across the organisation
Process & Governance - Data validation standards are continuously refined based on model failure post-mortems and evolving regulatory guidance
Technology & Tools - Automated data lineage tools track quality from source to training; anomaly detection flags degradation in production data feeds before it affects models
Measurement & Metrics - Data quality metrics are correlated with model performance outcomes, enabling evidence-based quality investment decisions

Key Measures

  • Percentage of model development projects that completed a formal data validation gate before training began
  • Average data quality score (completeness, correctness, representativeness) across datasets entering training pipelines
  • Number of model restarts or delays attributed to data quality issues discovered after training commenced
  • Rate of bias-related findings detected at data validation versus discovered post-deployment
  • Mean time to remediate a failed data validation check before model development can proceed
Associated Policies
Associated Practices
  • Transfer Learning and Fine-Tuning
  • Feasibility and Data Readiness Assessment
  • Data Pipeline Automation
  • Data Versioning and Lineage
  • Feature Engineering and Selection
  • Data Labelling and Annotation
  • Data Quality Assessment

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering