Standard : AI experiments are designed to produce learning within sprint-scale timeframes
Purpose and Strategic Importance
This standard requires that AI experiments are scoped, designed, and resourced so that they yield actionable learning — a confirmed hypothesis, a rejected assumption, or a clear next step — within a sprint-scale timeframe, typically two weeks. It supports the policy of prototyping and validating before building at scale by preventing open-ended research phases that consume resources without producing decisions. Sprint-scale experiments create the cadence needed for rapid learning and confident investment at scale.
Strategic Impact
- Creates a predictable learning cadence that aligns AI discovery work with product and engineering sprint rhythms
- Prevents prolonged, inconclusive experiments from blocking investment decisions and slowing roadmap progress
- Forces clarity about what a given experiment is trying to learn, reducing the frequency of experiments that answer the wrong question
- Builds a culture of structured hypothesis testing that improves the quality of AI investment decisions over time
- Enables faster pivoting when experiments reveal that an assumed approach is not viable, preserving budget for better-suited alternatives
Risks of Not Having This Standard
- AI experiments run for months without producing a clear decision, consuming budget and blocking downstream work
- Teams pursue interesting technical problems rather than the learning needed to progress the product roadmap
- Investment decisions are delayed because there is always "one more experiment" needed before a conclusion can be drawn
- Experiment results are inconclusive because the scope was too broad, the hypothesis too vague, or the timeframe too short to generate signal
- Engineering teams become frustrated when experimental work never converges to a deployable outcome
CMMI Maturity Model
Level 1 – Initial
| Category |
Description |
| People & Culture |
- Experiments are open-ended; "we'll know it when we see it" is the implicit success criterion |
| Process & Governance |
- No experiment design standard; researchers and engineers define their own timelines informally |
| Technology & Tools |
- Experiment tracking is ad hoc; results are stored in personal notebooks and not shared systematically |
| Measurement & Metrics |
- No measurement of experiment duration or learning yield; experiments are assessed retrospectively if at all |
Level 2 – Managed
| Category |
Description |
| People & Culture |
- Teams define a hypothesis and expected outcome before starting each experiment |
| Process & Governance |
- A simple experiment brief template captures hypothesis, success criteria, data requirements, and timeframe |
| Technology & Tools |
- Experiment results are logged in a shared tool; hypotheses and outcomes are linked for retrospective review |
| Measurement & Metrics |
- Experiment completion within the planned timeframe is tracked; teams review overrun experiments in retrospectives |
Level 3 – Defined
| Category |
Description |
| People & Culture |
- Sprint-scale experiment design is a standard skill; teams are coached to decompose large research questions into time-boxed experiments |
| Process & Governance |
- All AI experiments require a brief covering hypothesis, sprint-scale success criteria, and a decision gate that will be reached within two weeks |
| Technology & Tools |
- MLflow, Weights and Biases, or equivalent experiment tracking tools are in use; all experiments are logged with inputs, outputs, and conclusions |
| Measurement & Metrics |
- Percentage of experiments that produce a clear learning outcome within the planned timeframe is tracked and reported |
Level 4 – Quantitatively Managed
| Category |
Description |
| People & Culture |
- Teams are accountable for experiment yield rate; low-yield experiment patterns are investigated in retrospectives |
| Process & Governance |
- Experiment portfolios are reviewed weekly; experiments that have not produced learning within their timeframe are cancelled or reshaped |
| Technology & Tools |
- Automated experiment monitoring flags experiments that are running over time or producing no measurable signal |
| Measurement & Metrics |
- Experiment hypothesis accuracy rate, learning yield rate, and time-to-decision are measured and reviewed quarterly |
Level 5 – Optimising
| Category |
Description |
| People & Culture |
- Teams share experiment design patterns and failure modes organisationally; a library of effective AI experiment templates is maintained |
| Process & Governance |
- Experiment design standards are continuously refined based on which templates produce the highest learning yield |
| Technology & Tools |
- Automated experiment design assistance suggests appropriate scope and success criteria based on problem type and available data |
| Measurement & Metrics |
- Experiment efficiency metrics (learning per sprint invested) are used to optimise the balance between exploration and exploitation in the AI portfolio |
Key Measures
- Percentage of AI experiments that produced a clear, documented learning outcome within their stated timeframe
- Mean experiment duration from hypothesis definition to learning outcome
- Ratio of experiments that confirmed their hypothesis versus those that rejected it (as an indicator of hypothesis quality)
- Number of AI experiments that were cancelled mid-sprint due to unclear scope or unresolvable blockers
- Proportion of AI investment decisions in the quarter that were informed by experiments completed within a sprint-scale timeframe