• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Internal data is structured, accessible, and usable to enable AI-driven insights and automation

Purpose and Strategic Importance

Artificial intelligence and machine learning capabilities are only as good as the data they are trained on, fine-tuned with, and operate against at runtime. Organisations frequently invest in AI tooling and model capability while underestimating the foundational requirement: that internal data must be clean, well-catalogued, semantically described, and accessible through reliable interfaces before AI can deliver consistent value. Without this foundation, AI initiatives stall during the data preparation phase, produce unreliable outputs due to inconsistent inputs, or fail to reach production at all. This standard establishes the expectation that data quality, structure, and accessibility are treated as first-class engineering concerns — prerequisites to AI investment rather than afterthoughts.

The organisations that extract the most value from AI are those that treat their internal data as a strategic product. This means building data catalogues that make assets discoverable, defining data contracts between systems so that schemas and quality guarantees are explicit, creating API-accessible data products that AI agents and analytical pipelines can consume reliably, and establishing semantic layers that allow AI models to reason about business concepts rather than raw technical fields. By meeting this standard, engineering and data teams create a compounding asset — a data foundation that not only enables current AI use cases but accelerates future ones, reduces the cost of onboarding new models, and prevents the accumulation of data debt that eventually makes AI initiatives unviable.

Strategic Impact

  • AI and machine learning initiatives reach production faster and with higher confidence because data preparation effort is reduced — teams spend less time cleaning and wrangling data and more time building and validating models that deliver business value.
  • Automation initiatives based on structured, accessible data are more reliable and maintainable, as downstream AI agents and pipelines are insulated from upstream schema changes through versioned data contracts and stable API interfaces.
  • The organisation can reuse data products across multiple AI use cases, reducing duplication of effort and ensuring consistent feature engineering across models, which improves model comparability and reduces the risk of conflicting predictions.
  • Leadership and product teams gain confidence in AI-generated insights because the data lineage is transparent — stakeholders can trace outputs back to source data, understand transformations applied, and validate that models are reasoning from accurate, current information.

Risks of Not Having This Standard

  • AI initiatives repeatedly stall in the data preparation and cleaning phase, with teams spending the majority of project time on data discovery and quality remediation rather than on model development or value delivery.
  • Models trained or fine-tuned on poorly structured or inconsistently labelled internal data produce unreliable outputs, eroding stakeholder trust in AI capabilities and making it difficult to secure investment for future initiatives.
  • Fragmented, siloed data stores result in AI models with incomplete views of the business domain, leading to predictions and recommendations that are systematically biased by missing context or unrepresentative training samples.
  • Without data contracts and API-accessible interfaces, AI pipelines are tightly coupled to source system schemas, meaning that routine system changes break production AI workloads and create operational fragility.
  • The organisation accumulates data debt — legacy systems with undocumented schemas, inconsistent naming conventions, and no ownership — that compounds over time and eventually makes certain classes of AI use case technically infeasible without significant remediation investment.

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture Data is owned by individual systems or teams with no shared understanding of what data exists, where it lives, or what it represents at an organisational level.
Process & Governance There are no data contracts, data catalogues, or formal processes for managing data quality; AI projects discover data availability and quality issues on a per-project basis.
Technology & Tools Data is accessed directly from source databases or flat files with no abstraction layer, semantic definition, or versioning, making AI integration brittle and labour-intensive.
Measurement & Metrics There is no measurement of data quality, completeness, or accessibility; the extent of data problems is unknown until an AI project is in flight.

Level 2 – Managed

Category Description
People & Culture Individual teams have begun documenting their key data assets and schemas, but there is no cross-organisational catalogue and data discovery still relies heavily on tribal knowledge.
Process & Governance Some high-value data sources have informal quality checks and ownership assigned, but data contracts between systems are not consistently defined or enforced.
Technology & Tools Basic ETL pipelines and data warehouses are in place for reporting purposes, and some data products are queryable via SQL or basic APIs, though these are not designed for AI consumption.
Measurement & Metrics Basic data quality metrics such as null rates and row counts are tracked for some datasets, but there is no standardised framework for assessing fitness-for-purpose for AI use cases.

Level 3 – Defined

Category Description
People & Culture A data product mindset is established, with teams treating data as a deliverable with consumers; data owners are accountable for the quality and accessibility of their domain's data assets.
Process & Governance Data contracts are defined between producing and consuming systems, specifying schema, quality SLAs, and change notification obligations; a central data catalogue is maintained and kept current.
Technology & Tools Data products are exposed through versioned APIs or semantic layers that abstract source system complexity, enabling AI pipelines to consume data without direct coupling to source schemas.
Measurement & Metrics Data quality dimensions — accuracy, completeness, timeliness, and consistency — are measured and reported for all data products designated as AI-ready.

Level 4 – Quantitatively Managed

Category Description
People & Culture Data quality and accessibility are treated as engineering KPIs; product teams include data readiness as part of their definition of done for features that produce or transform data.
Process & Governance Data contracts are versioned and enforced via automated pipelines; SLA breaches on data quality trigger incident processes analogous to production service incidents.
Technology & Tools A feature store or semantic data layer provides AI teams with pre-engineered, reusable features derived from internal data products, reducing duplication and ensuring consistency across models.
Measurement & Metrics The full cost of data preparation per AI project is tracked and benchmarked, and improvements in data infrastructure are measured by reduction in time-to-data-ready for new AI initiatives.

Level 5 – Optimising

Category Description
People & Culture The organisation operates a federated data mesh model where domain teams are empowered and accountable for producing high-quality, AI-ready data products, with central governance providing standards and tooling.
Process & Governance Data contracts evolve continuously based on consumer feedback and AI model performance signals, with governance processes that balance agility with consistency and compliance requirements.
Technology & Tools Data platforms provide real-time, streaming data products alongside batch interfaces, enabling AI models to operate on current data and reduce latency between events and intelligent actions.
Measurement & Metrics Data ecosystem health is measured end-to-end — from source system quality through to AI model performance — providing a feedback loop that connects data investment decisions to measurable AI outcomes.

Key Measures

  • Percentage of internally produced data products with a defined data contract, including schema, quality SLA, owner, and change notification process, targeting full coverage for all AI-designated sources.
  • Data preparation time as a proportion of total AI project delivery time, tracked per initiative to demonstrate the return on investment from data infrastructure improvement.
  • Data catalogue coverage — the proportion of data assets discoverable via the central catalogue relative to the total number of known data sources across the organisation.
  • Data quality SLA compliance rate per data product, measured across accuracy, completeness, and timeliness dimensions for all AI-ready datasets.
  • Mean time to integrate a new data source into an AI pipeline, used as a proxy for the accessibility and API-readiness of the organisation's data ecosystem.
  • Number of AI initiatives blocked or delayed due to data quality or accessibility issues per quarter, tracked over time to demonstrate the impact of data foundation investment.
Associated Policies

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering