Data Versioning and Lineage | Engineering Practice

Practice : Data Versioning and Lineage

Purpose and Strategic Importance

Reproducibility in AI is only possible when the data used to train a model is as carefully versioned and documented as the code that trained it. Without data versioning, teams cannot reproduce past experiments, cannot explain why a model's behaviour changed between releases, and cannot trace a model's outputs back to the data that shaped them. This lack of traceability is not just an engineering inconvenience — it is a governance and accountability failure that makes it impossible to investigate incidents, demonstrate regulatory compliance, or build genuine trust in AI systems.

Data lineage extends versioning to capture the full provenance story: where data came from, what transformations were applied, and how it flowed through the pipeline to become training input. This is essential for understanding and communicating the assumptions embedded in a model, and for identifying the upstream changes that could affect model behaviour without any change to model code.

Description of the Practice

Versions all training datasets and feature stores with immutable identifiers, ensuring that every model training run can be associated with an exact, reproducible dataset snapshot.
Documents data lineage — the complete chain from raw source to processed training data — including all transformations, filtering steps, and integration points.
Stores dataset metadata alongside versions, capturing quality metrics, collection dates, source descriptions, and any known limitations.
Integrates data versioning with experiment tracking so that every recorded experiment links to the exact dataset version used.
Enables teams to reconstruct any past training run, including its data inputs, as a precondition for debugging, audit, and compliance.

How to Practise It (Playbook)

1. Getting Started

Adopt a data versioning tool — DVC, Delta Lake, LakeFS, or similar — and integrate it with your existing version control and experiment tracking workflows.
Start by versioning your most important training datasets and documenting their lineage, even if manually at first, to establish the habit and demonstrate its value.
Create a data versioning convention that specifies how versions are tagged, what metadata is required, and how lineage is recorded across the team.
Establish a policy that every model training run records the dataset version used, making this non-negotiable from the outset.

2. Scaling and Maturing

Automate lineage capture by instrumenting data pipelines to record transformation metadata as data flows through them, reducing manual documentation overhead.
Build data lineage visualisation into your MLOps tooling so that teams can trace any model output back to its data inputs through a navigable graph.
Extend versioning to cover feature stores, ensuring that features used in training and inference are versioned and the mapping between them is explicit.
Implement data version pinning in training configurations so that pipeline reruns use the intended dataset version rather than defaulting to the latest.

3. Team Behaviours to Encourage

Treat data versioning as non-negotiable from the start of every AI project — retrofitting it to existing systems is far more expensive than building it in from the beginning.
Document lineage in enough detail to explain the data story to a non-technical stakeholder — if you cannot explain where the data came from, you cannot defend the model it trained.
Review lineage documentation as part of model review and deployment approval, giving it the same status as code review.
Use data versioning actively for debugging — when a model's behaviour is unexpected, the first question should be "what data did it train on?" and you should be able to answer it instantly.

4. Watch Out For…

Versioning datasets but not feature transformations, creating a gap in lineage that makes it impossible to fully reproduce training.
Storing lineage documentation outside the version control system in ways that become disconnected from actual data versions over time.
Teams treating data versioning as an MLOps team responsibility rather than a shared engineering practice, leading to inconsistent adoption.
Lineage documentation that captures what was done but not why — missing the rationale for key decisions about filtering, sampling, and transformation.

5. Signals of Success

Any model training run can be reproduced exactly by checking out the recorded data version and running the recorded training configuration.
Lineage documentation for all production models is complete, current, and accessible to anyone who needs to review or audit a system.
When a model incident occurs, teams can trace unexpected behaviour back to its data root cause within hours rather than days.
Data versioning is embedded in the team's standard workflow — it requires no special effort because it is built into the tooling and conventions.
Regulators or auditors asking "where did this model's training data come from?" receive a clear, complete, documented answer.