Data Quality Assessment | Engineering Practice

Practice : Data Quality Assessment

Purpose and Strategic Importance

The quality of an AI model is fundamentally constrained by the quality of the data it is trained on. Garbage in, garbage out is not a cliché — it is a precise description of how model bias, poor generalisation, and unreliable predictions emerge. Teams that begin model development without systematically assessing their data quality are building on an uncertain foundation, and the consequences — models that fail in production, perpetuate historical bias, or produce outputs that cannot be trusted — can be severe and expensive to remediate.

Data quality assessment is also a fairness intervention. Historical datasets frequently reflect historical inequities: underrepresentation of certain demographic groups, labelling inconsistencies, or collection biases that encode the prejudices of the systems that generated the data. Surfacing these issues before training is far more effective than attempting to correct for them after the model has learned from them.

Description of the Practice

Evaluates training data across standard quality dimensions: completeness, accuracy, consistency, timeliness, and representativeness of the target population.
Identifies and documents data quality issues before model development begins, with severity ratings and remediation plans for each identified problem.
Assesses distributional characteristics of the dataset — including class imbalance, feature distributions, and demographic representation — that could affect model fairness and generalisability.
Validates data against the intended use case, checking that the data actually reflects the problem the model is designed to solve.
Produces a data quality report that informs go/no-go decisions for model development and is retained as part of the model's documentation.

How to Practise It (Playbook)

1. Getting Started

Define data quality criteria relevant to your domain and use case — generic checklists are a starting point, but contextual criteria (e.g., temporal freshness requirements, acceptable missing value rates) matter more.
Run a baseline quality assessment on your current primary training dataset to understand the landscape and identify the most significant issues.
Build a simple data quality scorecard that can be completed for any dataset and provides a consistent basis for comparison across projects.
Establish a policy that model development does not begin until a data quality assessment has been completed and any critical issues resolved.

2. Scaling and Maturing

Automate key data quality checks using tools such as Great Expectations, dbt tests, or custom validation scripts, making quality assessment a continuous process rather than a point-in-time activity.
Build quality metrics into data pipeline monitoring so that degradations in incoming data quality are detected and flagged before they affect model training or inference.
Extend quality assessment to cover the full data supply chain — not just the training dataset but the sources, transformations, and integration points that produce it.
Track data quality metrics over time to identify trends and systemic issues rather than treating each assessment as an isolated event.

3. Team Behaviours to Encourage

Treat data quality issues as first-class engineering problems, not data team problems — every engineer working on an AI system is responsible for understanding and engaging with the quality of its data.
Build time for data quality assessment into project estimates from the outset, rather than treating it as an optional activity that gets cut when delivery pressure builds.
Document and communicate data quality limitations transparently, including in model documentation, so that downstream users and stakeholders understand the constraints.
Celebrate the discovery of data quality issues — finding them before training is a success, not a setback.

4. Watch Out For…

Conflating data quantity with data quality — large datasets can contain large amounts of low-quality or biased data, and volume does not compensate for systematic quality problems.
Assessing quality only at the aggregate level while missing subgroup-level issues that will manifest as unfair model behaviour for specific user populations.
Treating data quality assessment as a one-time pre-training activity rather than an ongoing responsibility across the model lifecycle.
Allowing delivery pressure to compress the time allocated to data quality assessment, accepting known quality problems without formally documenting and accepting the associated risks.

5. Signals of Success

No model development begins without a completed, reviewed data quality assessment that is retained as part of the model documentation.
Data quality issues identified during assessment lead to concrete remediation actions, not just documentation that is filed and forgotten.
Teams can articulate the key quality limitations of their training data and the implications for model behaviour and appropriate use.
Automated quality checks run continuously on data pipelines, with alerts triggered when quality thresholds are breached.
Data quality has improved measurably over time as a result of systematic assessment and remediation practices.