Standard : Experiment-to-Production Cycle Time

Description

Experiment-to-Production Cycle Time measures the total elapsed time from when an AI experiment hypothesis is formally initiated to when the resulting model is receiving production traffic and generating real-world feedback. Unlike Model Deployment Lead Time, which measures only the pipeline phase, this metric captures the full end-to-end journey — including experiment design, data preparation, training runs, evaluation, stakeholder approval, and deployment.

This is the most holistic measure of AI delivery velocity. It answers the question that ultimately matters for business impact: how long does it take to go from an idea about how AI can help, to a validated, deployed solution that real users are experiencing? Long cycle times accumulate opportunity cost, increase the risk of building the wrong thing, and prevent the organisation from learning quickly enough to course-correct.

How to Use

What to Measure

Total elapsed time from experiment ticket creation (or sprint start) to first production traffic
Breakdown by phase: experiment design, data preparation, training, evaluation, stakeholder review, deployment
Percentage of experiments completing within one sprint (two weeks) vs two sprints vs longer
Ratio of experiments that reach production vs experiments that are abandoned or deprioritised after initiation
Cycle time trend over rolling quarters

Formula

Experiment-to-Production Cycle Time = Production Deployment Timestamp − Experiment Initiation Timestamp

Optional:

Phase contribution: time spent in each phase as a percentage of total cycle time
Cycle time efficiency: Active Working Time / Total Elapsed Time — low values indicate queue time and waiting

Instrumentation Tips

Create a standard experiment ticket template in the team's project tracking system with defined start and completion events
Use the experiment tracking system (MLflow, Weights & Biases) to automate capture of training and evaluation timestamps
Track stakeholder review wait time separately from technical execution time to identify organisational friction
Review cycle time distributions — not just averages — to identify whether a small number of long-running experiments are inflating the mean

Benchmarks

Metric Range	Interpretation
< 2 weeks (1 sprint)	Excellent — team is operating with true agility; fast learning cycles
2–4 weeks (1–2 sprints)	Good — reasonable velocity for most AI work; watch for creep
4–8 weeks	Needs improvement — experiment scope may be too large or organisational friction is high
> 8 weeks	Problematic — cycle time is too long for effective learning; redesign the approach to AI delivery

Why It Matters

Cycle time is the rate-limiting factor on AI learning velocity An organisation that can complete experiment-to-production cycles in two weeks learns six times faster than one taking twelve weeks. Over a year, this compounds into a decisive competitive advantage.
Long cycle times increase the cost of being wrong An experiment that takes eight weeks to reach production has consumed significant investment before the team knows whether the approach works. Sprint-scale cycles mean wrong directions are discovered and abandoned cheaply.
Cycle time reveals where organisational friction lives Detailed phase breakdowns often reveal that technical execution is fast but stakeholder approval or compliance review takes weeks. This points to process redesign opportunities that are often more valuable than technical optimisations.
Short cycles enable user-driven iteration When each cycle takes two weeks, the team can iterate based on production feedback four times in two months. When cycles take eight weeks, the team is locked into a direction for half a year before real-world learning can inform a change.

Best Practices

Apply timeboxing discipline: define the maximum experiment duration upfront and enforce it rather than allowing experiments to expand indefinitely
Use the "walking skeleton" pattern — deploy the simplest possible version of a model to production quickly, then iterate, rather than attempting to build the perfect model before any deployment
Run stakeholder alignment in parallel with technical work rather than sequentially to eliminate approval wait time from the critical path
Maintain a single, agreed "definition of done" for experiments that is understood by data scientists, product managers, and engineering leads
Review cycle time quarterly in the AI community of practice to share learning across teams

Common Pitfalls

Measuring only successful experiments — abandoned experiments that consumed significant time should be counted to avoid survivorship bias
Not capturing the time spent waiting for data access, labelling, or infrastructure provisioning as part of cycle time, masking systemic blockers
Conflating experiment scope with cycle time — a longer experiment is not inherently slower if the scope justifies the investment
Optimising for cycle time at the expense of experiment quality, producing fast but unreliable production deployments

Signals of Success

The majority of completed experiments reach production within two sprints
The team can articulate where cycle time is spent and has a specific plan to reduce the longest phase
No experiment has been in-flight for more than eight weeks without a deliberate scope justification documented and reviewed by the team lead
Cycle time has decreased meaningfully in the past six months as a result of deliberate process improvement

[[Model Deployment Lead Time]]
[[ML Pipeline Reliability Score]]
[[AI-Attributed Outcome Achievement Rate]]

Aligned Industry Research

Ries — The Lean Startup (Crown Business 2011) The Build-Measure-Learn loop that underpins Lean Startup methodology is directly applicable to AI experimentation. Ries demonstrates empirically that organisations that minimise cycle time outperform those optimising for quality of individual cycles, as the learning rate more than compensates for the slightly lower quality of any single iteration.
Humble & Farley — Continuous Delivery (Addison-Wesley 2010) The foundational arguments for short feedback cycles in software delivery apply with equal force to AI systems. The authors' demonstration that long cycle times are primarily caused by batch sizes rather than individual task complexity provides a useful diagnostic framework for AI teams.