Standard : AI experiments are designed to produce learning within sprint-scale timeframes

Purpose and Strategic Importance

This standard requires that AI experiments are scoped, designed, and resourced so that they yield actionable learning — a confirmed hypothesis, a rejected assumption, or a clear next step — within a sprint-scale timeframe, typically two weeks. It supports the policy of prototyping and validating before building at scale by preventing open-ended research phases that consume resources without producing decisions. Sprint-scale experiments create the cadence needed for rapid learning and confident investment at scale.

Strategic Impact

Creates a predictable learning cadence that aligns AI discovery work with product and engineering sprint rhythms
Prevents prolonged, inconclusive experiments from blocking investment decisions and slowing roadmap progress
Forces clarity about what a given experiment is trying to learn, reducing the frequency of experiments that answer the wrong question
Builds a culture of structured hypothesis testing that improves the quality of AI investment decisions over time
Enables faster pivoting when experiments reveal that an assumed approach is not viable, preserving budget for better-suited alternatives

Risks of Not Having This Standard

AI experiments run for months without producing a clear decision, consuming budget and blocking downstream work
Teams pursue interesting technical problems rather than the learning needed to progress the product roadmap
Investment decisions are delayed because there is always "one more experiment" needed before a conclusion can be drawn
Experiment results are inconclusive because the scope was too broad, the hypothesis too vague, or the timeframe too short to generate signal
Engineering teams become frustrated when experimental work never converges to a deployable outcome

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	- Experiments are open-ended; "we'll know it when we see it" is the implicit success criterion
Process & Governance	- No experiment design standard; researchers and engineers define their own timelines informally
Technology & Tools	- Experiment tracking is ad hoc; results are stored in personal notebooks and not shared systematically
Measurement & Metrics	- No measurement of experiment duration or learning yield; experiments are assessed retrospectively if at all

Level 2 – Managed

Category	Description
People & Culture	- Teams define a hypothesis and expected outcome before starting each experiment
Process & Governance	- A simple experiment brief template captures hypothesis, success criteria, data requirements, and timeframe
Technology & Tools	- Experiment results are logged in a shared tool; hypotheses and outcomes are linked for retrospective review
Measurement & Metrics	- Experiment completion within the planned timeframe is tracked; teams review overrun experiments in retrospectives

Level 3 – Defined

Category	Description
People & Culture	- Sprint-scale experiment design is a standard skill; teams are coached to decompose large research questions into time-boxed experiments
Process & Governance	- All AI experiments require a brief covering hypothesis, sprint-scale success criteria, and a decision gate that will be reached within two weeks
Technology & Tools	- MLflow, Weights and Biases, or equivalent experiment tracking tools are in use; all experiments are logged with inputs, outputs, and conclusions
Measurement & Metrics	- Percentage of experiments that produce a clear learning outcome within the planned timeframe is tracked and reported

Level 4 – Quantitatively Managed

Category	Description
People & Culture	- Teams are accountable for experiment yield rate; low-yield experiment patterns are investigated in retrospectives
Process & Governance	- Experiment portfolios are reviewed weekly; experiments that have not produced learning within their timeframe are cancelled or reshaped
Technology & Tools	- Automated experiment monitoring flags experiments that are running over time or producing no measurable signal
Measurement & Metrics	- Experiment hypothesis accuracy rate, learning yield rate, and time-to-decision are measured and reviewed quarterly

Level 5 – Optimising

Category	Description
People & Culture	- Teams share experiment design patterns and failure modes organisationally; a library of effective AI experiment templates is maintained
Process & Governance	- Experiment design standards are continuously refined based on which templates produce the highest learning yield
Technology & Tools	- Automated experiment design assistance suggests appropriate scope and success criteria based on problem type and available data
Measurement & Metrics	- Experiment efficiency metrics (learning per sprint invested) are used to optimise the balance between exploration and exploitation in the AI portfolio

Key Measures

Percentage of AI experiments that produced a clear, documented learning outcome within their stated timeframe
Mean experiment duration from hypothesis definition to learning outcome
Ratio of experiments that confirmed their hypothesis versus those that rejected it (as an indicator of hypothesis quality)
Number of AI experiments that were cancelled mid-sprint due to unclear scope or unresolvable blockers
Proportion of AI investment decisions in the quarter that were informed by experiments completed within a sprint-scale timeframe