Ragan McGill

AI tooling is selected with developer experience as a primary criterion

This standard requires that when AI tooling — including IDEs, ML platforms, experiment tracking tools, data labelling environments, and deployment infrastructure — is selected or evaluated, developer experience is treated as a primary criterion alongside capability and cost. It supports the policy of making AI work sustainable for the people who build it by recognising that tools shape the working lives of AI practitioners profoundly. Poor tooling creates friction, increases cognitive load, and drives burnout and attrition in teams already operating in a demanding discipline.

AI users have accessible mechanisms to challenge or correct AI outputs

This standard requires that any person who interacts with or is affected by an AI system output must have access to a clear, accessible mechanism to challenge that output, request a human review, or provide a correction. It supports the policy of making AI transparent to the people it affects by recognising that transparency without agency is insufficient. The ability to contest an AI decision is both an ethical obligation and, in many jurisdictions, a legal right under data protection and automated decision-making legislation.

AI work is recognised and celebrated as a team achievement

This standard requires that AI delivery achievements — including shipped models, resolved incidents, failed experiments that produced learning, and cross-functional contributions — are recognised and celebrated at a team level rather than attributed solely to individual contributors or senior champions. It supports the policy of sharing AI knowledge openly across the organisation by creating a recognition culture that motivates sharing, collaboration, and collective ownership. Teams that feel their work is invisible or unattributed disengage, hoard knowledge, and ultimately deliver less.

AI teams operate with clear ownership and psychological safety

This standard requires that AI teams have clear, documented ownership of their systems and operate in an environment where team members can raise concerns, flag risks, and challenge decisions without fear of negative consequences. It supports the policy of making AI work sustainable for the people who build it by recognising that unclear ownership and psychological unsafety are among the most damaging and under-acknowledged threats to AI team performance and ethics. Teams that cannot speak up about problems build worse systems and take greater risks.

Engineers are not required to deploy AI systems they have safety concerns about

This standard establishes that no engineer or technical practitioner can be required, pressured, or incentivised to deploy an AI system they have documented, good-faith safety concerns about. It supports the policy of making AI work sustainable for the people who build it by recognising that engineers are often the first to identify safety issues and must have a protected channel to raise those concerns without career risk. This standard is the technical and human complement to formal governance — it ensures that the people closest to the code have meaningful authority to act on what they know.

AI governance frameworks are documented and followed across the lifecycle

This standard requires that AI systems are governed by a documented framework that defines accountability, decision rights, review checkpoints, and escalation paths at every stage of the AI lifecycle — from inception through retirement. It supports the policy of governing AI models throughout their lifecycle by making governance a structured, continuous activity rather than a compliance exercise conducted only at deployment. AI governance that exists only on paper or only at launch provides the illusion of control without the substance.

Model degradation triggers are defined and monitored in production

This standard requires that for every AI model in production, specific quantitative degradation triggers are defined before deployment and actively monitored in live operation. When a trigger is breached, a defined response — alert, escalation, automatic rollback, or retraining initiation — must occur without relying on manual detection. It supports the policy of governing AI models throughout their lifecycle by treating post-deployment governance as an engineering concern, not a periodic management review. Without defined triggers, degradation is discovered through user complaints rather than instrumentation.

All AI decisions above defined risk thresholds require human review

This standard requires that AI systems operating in decision-making contexts identify and route decisions that meet or exceed a defined risk threshold to a human reviewer before action is taken or the decision is communicated. It supports the policy of ensuring AI decisions are reviewable by humans by creating a systematic, proportionate control mechanism that preserves human agency for consequential outcomes without imposing review overhead on every AI output. The threshold must be defined by risk, not convenience.

Bias and fairness assessments are conducted at every model release

This standard mandates that every AI model release includes a formal bias and fairness assessment that evaluates model performance across relevant demographic and contextual subgroups before the release reaches production. It supports the policy of managing bias as an ongoing operational concern by recognising that bias is not a one-time problem to be solved at the start of a project but a property that must be re-evaluated whenever the model, data, or deployment context changes. A model that was fair at launch can become unfair after retraining on shifted data or deployment in a new context.

AI systems provide explainable outputs for high-stakes decisions

This standard requires that AI systems involved in high-stakes decisions — those with material consequences for individuals, business operations, or safety — must produce outputs accompanied by an explanation that a human reviewer or affected party can understand and act upon. It supports the policy of designing for explainability and not just accuracy by making explainability a functional requirement, not an afterthought. A model with superior accuracy but no explainability is often less deployable in practice than a slightly less accurate model whose reasoning can be interrogated and defended.

Data ecosystems are reliable, governed, and designed to support decision-making and innovation

As organisations deploy AI systems to support operational decisions, customer interactions, and strategic planning, the reliability and governance of the underlying data ecosystem becomes a safety-critical concern. AI models do not fail loudly — they produce outputs that appear plausible while being systematically biased, outdated, or incomplete. When those outputs drive hiring decisions, credit assessments, medical triage, or operational risk management, the consequences of poor data governance are not merely technical: they manifest as discriminatory outcomes, regulatory breaches, or decisions made on a false picture of reality. This standard establishes that data ecosystems must be governed with the same rigour applied to production software — with defined ownership, lineage tracking, quality SLAs, access controls, and retention policies that ensure data is trustworthy at the point of use.

Good data governance is frequently mischaracterised as a compliance burden — a set of controls imposed on innovation. In practice, well-governed data ecosystems accelerate innovation by making data discoverable, trustworthy, and reusable, reducing the time teams spend questioning data provenance or remediating quality issues before they can act. Federated governance models, informed by data mesh principles, allow domain teams to own and operate their data products within a framework of organisational standards, balancing autonomy with accountability. This standard supports that balance — enabling teams to move quickly while ensuring that the data underpinning AI-driven decisions is reliable, auditable, and fit for purpose across its full lifecycle.

AI systems are tested with adversarial and edge case inputs

This standard requires that AI systems be deliberately probed with inputs designed to expose failure modes — including adversarial examples crafted to fool the model, edge cases at the boundary of the training distribution, and real-world noise scenarios. It supports the policy of rigorous pre-deployment evaluation by ensuring that models are tested not just on clean representative data but on the messy, ambiguous, and intentionally malicious inputs they will encounter in production. Passing standard accuracy benchmarks is necessary but not sufficient for safe deployment.

AI models are versioned and reproducible across environments

Model versioning and reproducibility ensure that every AI artefact — including training data snapshots, hyperparameters, code, and environment configurations — can be reconstructed exactly. This standard supports the policy of building AI systems that learn and improve continuously by providing a stable foundation from which improvements can be measured and regression can be identified. Without it, teams cannot confidently iterate on models because they have no reliable baseline to compare against.

Training data quality is validated before model development begins

This standard mandates that training data must pass defined quality checks — covering completeness, correctness, representativeness, and absence of harmful bias — before any model development work begins. It supports the policy of treating data quality as a first-class concern by preventing the well-documented "garbage in, garbage out" failure mode that undermines AI credibility. Teams that skip data validation waste engineering cycles building models on foundations that cannot support reliable predictions.

Model performance is benchmarked against defined baselines before release

This standard requires that every AI model be evaluated against a documented baseline — such as a previous model version, a rule-based heuristic, or a human performance benchmark — before it is released to production. It supports the policy of rigorous pre-deployment evaluation by ensuring that "better" is defined objectively rather than assumed. Without baseline comparison, teams cannot determine whether a model is genuinely improving outcomes or simply replacing one set of failure modes with another.

Post-deployment model performance is monitored continuously

This standard establishes the requirement that all AI models in production are subject to continuous performance monitoring, including tracking of prediction quality, data drift, concept drift, and system health metrics. It supports the policy of building AI systems that learn and improve continuously by ensuring that the team is never operating blind after deployment. AI models degrade silently as the world changes around them; without monitoring, that degradation becomes a business risk that compounds over time.

Model complexity is proportionate to the problem being solved

This standard establishes that the complexity of an AI model — in terms of architecture, compute requirements, interpretability, and operational overhead — must be justified by the problem it is solving. It supports the policy of prioritising AI use cases by impact and not novelty by countering the engineering tendency to reach for the most sophisticated tool regardless of whether the problem demands it. A logistic regression that solves the problem reliably, cheaply, and explainably is often more valuable than a large neural network that solves it marginally better at ten times the cost.

AI systems deliver measurable improvement over non-AI alternatives

This standard requires that every AI system in production must demonstrate a measurable improvement over the best available non-AI alternative — whether that is a rules-based system, a manual process, or a statistical heuristic. It supports the policy of measuring what AI delivers, not just what it predicts, by anchoring AI value in a concrete comparative context. The question is never whether the AI works in isolation; it is whether the AI is the best use of investment for the problem at hand.

Internal data is structured, accessible, and usable to enable AI-driven insights and automation

Artificial intelligence and machine learning capabilities are only as good as the data they are trained on, fine-tuned with, and operate against at runtime. Organisations frequently invest in AI tooling and model capability while underestimating the foundational requirement: that internal data must be clean, well-catalogued, semantically described, and accessible through reliable interfaces before AI can deliver consistent value. Without this foundation, AI initiatives stall during the data preparation phase, produce unreliable outputs due to inconsistent inputs, or fail to reach production at all. This standard establishes the expectation that data quality, structure, and accessibility are treated as first-class engineering concerns — prerequisites to AI investment rather than afterthoughts.

The organisations that extract the most value from AI are those that treat their internal data as a strategic product. This means building data catalogues that make assets discoverable, defining data contracts between systems so that schemas and quality guarantees are explicit, creating API-accessible data products that AI agents and analytical pipelines can consume reliably, and establishing semantic layers that allow AI models to reason about business concepts rather than raw technical fields. By meeting this standard, engineering and data teams create a compounding asset — a data foundation that not only enables current AI use cases but accelerates future ones, reduces the cost of onboarding new models, and prevents the accumulation of data debt that eventually makes AI initiatives unviable.

AI output quality is measured against human baseline performance

This standard requires that AI output quality be benchmarked against the performance of a human doing the same task — whether that is a domain expert, a trained operator, or an average practitioner — to establish whether AI delivers genuine value over human alternatives. It supports the policy of measuring what AI delivers, not just what it predicts, by anchoring evaluation in the real-world context where AI and humans are alternatives or collaborators. A model that outperforms a trivial heuristic but underperforms a junior analyst provides little practical value.

AI use cases are selected based on validated business impact

This standard requires that AI use cases must demonstrate validated evidence of business impact — through data analysis, user research, or a structured feasibility assessment — before engineering investment is approved. It supports the policy of aligning AI investment to measurable outcomes by ensuring that resources flow toward problems where AI can make a genuine, quantifiable difference. Use cases selected on the basis of novelty, vendor enthusiasm, or executive sponsorship alone consistently underdeliver on value.

AI investment decisions are informed by value realisation data

This standard requires that decisions to continue, scale, pause, or stop AI investment must be grounded in value realisation data collected from live systems — not just projections made at project inception. It supports the policy of aligning AI investment to measurable outcomes by creating a feedback loop between deployed AI outcomes and funding decisions. Without this feedback loop, organisations continue investing in AI systems that have stopped delivering value while starving high-potential opportunities of resources.

AI experiments are designed to produce learning within sprint-scale timeframes

This standard requires that AI experiments are scoped, designed, and resourced so that they yield actionable learning — a confirmed hypothesis, a rejected assumption, or a clear next step — within a sprint-scale timeframe, typically two weeks. It supports the policy of prototyping and validating before building at scale by preventing open-ended research phases that consume resources without producing decisions. Sprint-scale experiments create the cadence needed for rapid learning and confident investment at scale.

AI models are deployed via automated, repeatable pipelines

This standard requires that AI models are deployed through automated, version-controlled pipelines that are consistent across environments and repeatable without manual intervention. It supports the policy of reducing time from data to deployed intelligence by eliminating the manual, error-prone handoffs that make model deployment slow and risky. When deployment is a push-button, auditable process, teams can ship improvements frequently and safely rather than treating each release as a high-stakes, all-hands exercise.

Model iteration cycles are measured and continuously shortened

This standard requires that the end-to-end cycle time for a model iteration — from identifying a need to improve to deploying a validated improvement to production — is measured, visible to the team, and subject to continuous improvement. It supports the policy of shipping AI incrementally to learn faster by treating iteration speed as a strategic metric rather than an incidental outcome of project planning. Teams that can iterate on models quickly compound their learning faster and respond to production signals before they become serious problems.

AI prototypes reach real users before full-scale build begins

This standard requires that AI capabilities are validated with real users in a meaningful operational context before the team invests in full-scale engineering, infrastructure, and operational build-out. It supports the policy of prototyping and validating before building at scale by creating a mandatory checkpoint where user feedback, adoption signals, and practical performance data inform the decision to proceed. Building AI at scale before validating user value is one of the most expensive and common failure modes in AI programme delivery.

Production feedback loops are closed within defined time limits

This standard requires that signals generated by AI systems in production — including user corrections, downstream outcome data, and human review decisions — are captured, processed, and made available to model improvement workflows within defined time limits. It supports the policy of using feedback loops to continuously improve AI performance by ensuring that learning from live deployment is systematic and timely rather than ad hoc and delayed. An AI system that cannot learn from its production experience degrades relative to the changing world it operates in.

Standards