• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : AI experiments are designed to produce learning within sprint-scale timeframes

Purpose and Strategic Importance

This standard requires that AI experiments are scoped, designed, and resourced so that they yield actionable learning — a confirmed hypothesis, a rejected assumption, or a clear next step — within a sprint-scale timeframe, typically two weeks. It supports the policy of prototyping and validating before building at scale by preventing open-ended research phases that consume resources without producing decisions. Sprint-scale experiments create the cadence needed for rapid learning and confident investment at scale.

Strategic Impact

  • Creates a predictable learning cadence that aligns AI discovery work with product and engineering sprint rhythms
  • Prevents prolonged, inconclusive experiments from blocking investment decisions and slowing roadmap progress
  • Forces clarity about what a given experiment is trying to learn, reducing the frequency of experiments that answer the wrong question
  • Builds a culture of structured hypothesis testing that improves the quality of AI investment decisions over time
  • Enables faster pivoting when experiments reveal that an assumed approach is not viable, preserving budget for better-suited alternatives

Risks of Not Having This Standard

  • AI experiments run for months without producing a clear decision, consuming budget and blocking downstream work
  • Teams pursue interesting technical problems rather than the learning needed to progress the product roadmap
  • Investment decisions are delayed because there is always "one more experiment" needed before a conclusion can be drawn
  • Experiment results are inconclusive because the scope was too broad, the hypothesis too vague, or the timeframe too short to generate signal
  • Engineering teams become frustrated when experimental work never converges to a deployable outcome

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture - Experiments are open-ended; "we'll know it when we see it" is the implicit success criterion
Process & Governance - No experiment design standard; researchers and engineers define their own timelines informally
Technology & Tools - Experiment tracking is ad hoc; results are stored in personal notebooks and not shared systematically
Measurement & Metrics - No measurement of experiment duration or learning yield; experiments are assessed retrospectively if at all

Level 2 – Managed

Category Description
People & Culture - Teams define a hypothesis and expected outcome before starting each experiment
Process & Governance - A simple experiment brief template captures hypothesis, success criteria, data requirements, and timeframe
Technology & Tools - Experiment results are logged in a shared tool; hypotheses and outcomes are linked for retrospective review
Measurement & Metrics - Experiment completion within the planned timeframe is tracked; teams review overrun experiments in retrospectives

Level 3 – Defined

Category Description
People & Culture - Sprint-scale experiment design is a standard skill; teams are coached to decompose large research questions into time-boxed experiments
Process & Governance - All AI experiments require a brief covering hypothesis, sprint-scale success criteria, and a decision gate that will be reached within two weeks
Technology & Tools - MLflow, Weights and Biases, or equivalent experiment tracking tools are in use; all experiments are logged with inputs, outputs, and conclusions
Measurement & Metrics - Percentage of experiments that produce a clear learning outcome within the planned timeframe is tracked and reported

Level 4 – Quantitatively Managed

Category Description
People & Culture - Teams are accountable for experiment yield rate; low-yield experiment patterns are investigated in retrospectives
Process & Governance - Experiment portfolios are reviewed weekly; experiments that have not produced learning within their timeframe are cancelled or reshaped
Technology & Tools - Automated experiment monitoring flags experiments that are running over time or producing no measurable signal
Measurement & Metrics - Experiment hypothesis accuracy rate, learning yield rate, and time-to-decision are measured and reviewed quarterly

Level 5 – Optimising

Category Description
People & Culture - Teams share experiment design patterns and failure modes organisationally; a library of effective AI experiment templates is maintained
Process & Governance - Experiment design standards are continuously refined based on which templates produce the highest learning yield
Technology & Tools - Automated experiment design assistance suggests appropriate scope and success criteria based on problem type and available data
Measurement & Metrics - Experiment efficiency metrics (learning per sprint invested) are used to optimise the balance between exploration and exploitation in the AI portfolio

Key Measures

  • Percentage of AI experiments that produced a clear, documented learning outcome within their stated timeframe
  • Mean experiment duration from hypothesis definition to learning outcome
  • Ratio of experiments that confirmed their hypothesis versus those that rejected it (as an indicator of hypothesis quality)
  • Number of AI experiments that were cancelled mid-sprint due to unclear scope or unresolvable blockers
  • Proportion of AI investment decisions in the quarter that were informed by experiments completed within a sprint-scale timeframe
Associated Policies
Associated Practices
  • Experiment Tracking and Management
  • Hyperparameter Tuning Practices
  • AI Prototyping and PoC

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering