Practice : Model Reproducibility Standards
Purpose and Strategic Importance
Reproducibility is the bedrock of scientific and engineering rigour in AI. A model whose training process cannot be reproduced is a black box not just to users but to the team that built it. This matters practically — when a production model fails, you need to be able to recreate it; when a model needs to be retrained, you need to start from a known-good configuration. It also matters ethically and legally: demonstrating that a model behaves as intended requires being able to recreate and verify the process that produced it.
The ML reproducibility crisis — where published research results cannot be independently replicated — has a direct counterpart in production AI teams. Models that depend on undocumented random seeds, unversioned training data, specific hardware configurations, or the institutional knowledge of a particular engineer are fragile assets. Reproducibility standards address this fragility systematically, ensuring that AI systems remain auditable, maintainable, and resilient to team and infrastructure changes.
Description of the Practice
- Records all factors that affect model training output — code version, data version, random seeds, hardware configuration, framework versions, and hyperparameters — in a reproducible training specification.
- Manages dependencies through containerisation and dependency pinning, ensuring training environments can be recreated exactly on demand.
- Seeds all sources of randomness explicitly and consistently, enabling deterministic or near-deterministic training runs.
- Validates reproducibility by periodically rerunnning training specifications and comparing outputs against baseline metrics, detecting silent drift in the training pipeline.
- Maintains training specifications alongside model artefacts in version control, so that every deployed model has an associated reproducible training recipe.
How to Practise It (Playbook)
1. Getting Started
- Audit your current most important model to identify all factors that would need to be known to recreate it — code, data, configuration, environment — and document any that are currently missing or informal.
- Containerise training environments using Docker or similar, capturing framework versions, library dependencies, and system configuration in a version-controlled definition.
- Implement explicit random seed management in all training code, ensuring seeds are recorded in experiment logs and can be set consistently for reproduction runs.
- Establish a reproducibility standard that defines what "reproducible" means for your organisation — exact bitwise reproducibility, statistical reproducibility within defined bounds, or functional equivalence — appropriate to your model types and hardware.
2. Scaling and Maturing
- Build automated reproducibility validation into your CI pipeline, periodically reproducing selected model training runs and comparing results against stored baselines to detect pipeline drift.
- Implement model training as fully parameterised pipeline code, with all configuration externalised to version-controlled configuration files rather than hardcoded in training scripts.
- Develop runbooks for model reproduction that guide engineers through the process of recreating any production model from first principles, testing these runbooks regularly.
- Extend reproducibility standards to cover inference environments as well as training, ensuring that serving infrastructure can be recreated as reliably as the models it runs.
3. Team Behaviours to Encourage
- Treat any training run that cannot be reproduced from its logged specification as a defect — investigate and resolve the gap rather than accepting it as normal ML variability.
- Include reproducibility verification as part of model review, requiring teams to demonstrate that a candidate model can be recreated from its training specification before deployment approval.
- Maintain the "last known good" training specification for every production model actively, not just as a historical record — this is what the team will rely on when they need to retrain urgently.
- Document non-determinism explicitly when it is unavoidable — for example, in distributed training — explaining its expected bounds and why it does not compromise the model's intended behaviour.
4. Watch Out For…
- Conflating code reproducibility (using the same code) with full training reproducibility (producing the same model) — the latter also requires data versioning, environment specification, and seed management.
- Hardware-specific dependencies that make models reproducible only on particular GPU models or cloud providers, creating operational fragility.
- Reproducibility standards that are documented but not tested, providing false assurance until the moment when reproduction is actually needed under pressure.
- Letting reproducibility slide for "experimental" models that subsequently get promoted to production without the reproducibility standards being applied retrospectively.
5. Signals of Success
- Any production model can be reproduced from its training specification by any engineer on the team, without relying on the knowledge of the original author.
- Reproducibility is validated periodically through automated re-runs, with deviations from baseline metrics triggering investigation.
- The time required to reproduce a production model from scratch is measured and tracked, with a clear target that reflects the organisation's operational resilience requirements.
- Model audits or incident investigations can be completed with full access to reproducible training artefacts, without gaps in the documentation chain.
- Engineers joining the team can reproduce existing production models within their first week, using documentation alone.