Label Quality Score measures the accuracy and consistency of the labels applied to training data in supervised learning systems. It is typically expressed as the inter-annotator agreement (IAA) rate — the degree to which multiple independent annotators assign the same label to the same data point — alongside error rate estimates from expert audit of a labelled sample. High label quality is a prerequisite for learning a well-calibrated model; label noise is one of the most insidious and under-measured sources of AI system failure.
Label quality problems compound in ways that are difficult to detect. A model trained on 10% mislabelled data will learn to replicate those errors, meaning the model's mistakes reflect its teachers' mistakes. Worse, if evaluation data carries the same label quality issues as training data, standard metrics will overestimate true model quality. By measuring label quality independently of model training, teams create an accountability mechanism for the annotation process that protects model quality upstream.
IAA Rate = (Agreements / Total Annotated Items) × 100
Cohen's Kappa = (Observed Agreement − Expected Agreement) / (1 − Expected Agreement)
Optional:
| Metric Range | Interpretation |
|---|---|
| Kappa ≥ 0.80, IAA ≥ 95% | Excellent — labels are highly consistent and reliable |
| Kappa 0.60–0.79, IAA 85–94% | Good — some ambiguity exists; review guidelines for edge cases |
| Kappa 0.40–0.59, IAA 70–84% | Needs improvement — significant annotator disagreement; labelling guidelines require revision |
| Kappa < 0.40 or IAA < 70% | Unacceptable — dataset is not suitable for supervised learning without remediation |
Label noise directly caps the maximum achievable model accuracy Theoretical work by Frenay and Verleysen demonstrates that irreducible error floors exist for models trained on noisy labels, meaning no amount of model architecture improvement can overcome poor labelling quality.
Inconsistent labels are a hidden source of demographic bias If annotators apply labels less consistently for content from certain demographic groups — a documented phenomenon in content moderation and sentiment analysis — the resulting model will encode that inconsistency as a learned bias.
Label quality problems are invisible in standard evaluation metrics When evaluation data has the same label quality issues as training data, accuracy metrics look healthy even as the model is learning to replicate systematic mislabelling. Independent label quality measurement breaks this false reassurance.
Early detection enables cheap remediation Identifying label quality problems before training is inexpensive. Discovering them after a model is deployed — through poor performance, bias audit findings, or user complaints — is significantly more costly.
Frenay & Verleysen — Classification in the Presence of Label Noise: A Survey (IEEE Transactions on Neural Networks 2014) This comprehensive survey formalises the relationship between label noise and model performance bounds, providing empirical evidence that even moderate noise rates (10–20%) produce substantial accuracy degradation across model families.
Northcutt et al. — Confident Learning: Estimating Uncertainty in Dataset Labels (JAIR 2021) The Cleanlab paper introduces practical algorithms for estimating label error rates in large datasets without requiring full double-annotation, providing accessible tooling for teams that cannot annotate every item twice.