Data Labelling and Annotation | Engineering Practice

Practice : Data Labelling and Annotation

Purpose and Strategic Importance

For supervised learning systems, the quality of labels is the quality of the model. Ambiguous labelling guidelines, inconsistent annotator behaviour, and unexamined annotator bias directly shape what the model learns to do. A model trained on labels that reflect the perspectives or biases of a narrow annotator group will encode those biases — and because the labels appear "objective" in the dataset, these biases can be invisible until the model causes harm in production.

High-quality data labelling is also expensive and time-consuming, making it a strategic investment that deserves rigorous engineering. Teams that invest in clear annotation guidelines, annotator training, inter-annotator agreement measurement, and quality assurance processes produce more reliable labels with less rework, and build datasets that retain their value across multiple model generations.

Description of the Practice

Develops detailed annotation guidelines that cover edge cases, disambiguation rules, and examples of correctly and incorrectly labelled instances for every label class.
Trains annotators on guidelines and measures inter-annotator agreement (e.g., Cohen's Kappa) before production labelling begins, iterating on guidelines until agreement is satisfactory.
Implements quality assurance processes including gold standard sets, random sampling audits, and calibration sessions to monitor label quality throughout the labelling process.
Manages annotator diversity deliberately, considering whether the annotator population reflects the diversity of the users and contexts for which the model will be deployed.
Versions labelled datasets with full documentation of annotator composition, agreement metrics, and any known labelling issues or systematic disagreements.

How to Practise It (Playbook)

1. Getting Started

Before writing annotation guidelines, analyse a sample of the raw data to understand the edge cases and ambiguities that guidelines will need to address — guidelines written without this analysis miss the hard cases.
Run a pilot labelling exercise with a small sample, measure inter-annotator agreement, and use disagreements as input to refine guidelines before scaling labelling effort.
Define the minimum acceptable inter-annotator agreement threshold for your use case and do not proceed to production labelling until it is met.
Establish a process for escalating genuinely ambiguous instances that cannot be resolved by guidelines alone, ensuring these decisions are made consistently and documented.

2. Scaling and Maturing

Build annotation workflows into tooling that tracks per-annotator quality metrics, enabling early identification of annotators who may need additional training or calibration.
Implement active learning techniques to prioritise labelling effort on the instances most likely to improve model performance, reducing the volume of labels required.
Develop processes for managing labelling at scale — whether through internal annotator teams or external labelling services — that maintain consistent quality standards regardless of volume.
Conduct regular retrospectives on labelling quality, using model evaluation results to identify whether systematic labelling errors are contributing to model weaknesses.

3. Team Behaviours to Encourage

Treat annotation guideline development as a collaborative process between data scientists, domain experts, and annotators — the people doing the labelling often have the deepest insight into edge cases.
Build feedback loops between annotators and model developers so that labelling issues identified by annotators inform model development, and model failures inform annotation review.
Document systematic disagreements rather than resolving them by majority vote — genuine ambiguity in labels is information about the problem that may be important for model design or use case scoping.
Consider the welfare and working conditions of annotators — particularly for annotation tasks that involve exposure to harmful content — as an ethical responsibility, not just a quality management concern.

4. Watch Out For…

Treating annotation guidelines as final once written, rather than living documents that are updated as edge cases emerge and annotator feedback is incorporated.
Measuring only overall agreement metrics while missing systematic biases in labelling patterns across demographic subgroups of the data.
Outsourcing labelling without adequate investment in annotator training, quality assurance, and cultural context, particularly for tasks that require nuanced cultural or domain knowledge.
Failing to document the limitations and known issues of a labelled dataset, so that downstream model developers and users understand the constraints on what it can support.

5. Signals of Success

Inter-annotator agreement metrics are tracked and reported for all labelled datasets, and meet defined thresholds before labelling is accepted as complete.
Annotation guidelines are versioned alongside datasets and are detailed enough for a new annotator to reach acceptable agreement without requiring individual coaching.
Labelling quality is validated through periodic audits, with results fed back into annotator calibration and guideline updates.
Systematic labelling biases — particularly across demographic subgroups — are identified and documented before model training, informing decisions about mitigation.
Every labelled dataset includes documentation of annotator composition, agreement statistics, and known limitations, retained as part of the model's documentation.