Standard : Systems provide real-time visibility into performance, behaviour, and health

Purpose and Strategic Importance

Systems that cannot be observed cannot be reliably operated or improved. This standard establishes that structured logging, distributed tracing, and metrics must be treated as core engineering requirements — designed in from the outset, not retrofitted after incidents expose gaps. When teams have genuine observability, they can understand system behaviour in production without needing to reproduce problems locally, dramatically reducing mean time to detect and recover from failures.

Observability is the foundation of confident delivery. The three pillars — logs, metrics, and traces — provide complementary lenses on system health: logs capture discrete events, metrics surface aggregated trends, and traces reveal the path of requests through distributed systems. Together, they empower engineers to ask novel questions about system behaviour, support informed capacity planning, and give product and operations teams the shared situational awareness needed to act with confidence.

Strategic Impact

Engineers can diagnose and resolve production incidents faster, reducing customer impact and shortening mean time to recovery across all critical services.
Teams gain the confidence to deploy frequently, knowing they can detect regressions or degradations in real time before they compound into major failures.
Observability data drives informed architectural and capacity decisions, replacing speculation with evidence and reducing costly over-provisioning or under-provisioning.
Shared visibility across engineering, operations, and product teams reduces information silos, accelerates root cause analysis, and aligns everyone on system performance against business objectives.

Risks of Not Having This Standard

Production issues go undetected until customers report them, resulting in extended outages, reputational damage, and eroded user trust.
Engineers spend significant time and effort reproducing issues locally rather than diagnosing them directly in the production environment where they actually occurred.
Incident response is slowed by lack of correlated data, causing teams to work from assumption and guesswork rather than observable evidence.
Capacity and performance problems escalate silently until they reach critical thresholds, removing the opportunity for proactive intervention.
Teams accumulate hidden operational risk as system complexity grows faster than the team's ability to understand and reason about system behaviour.

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	Observability is not a shared concern and visibility into production systems is reactive and ad hoc.
Process & Governance	There are no standards for logging, monitoring, or alerting and practices vary entirely by individual choice.
Technology & Tools	Basic application logging exists in some services but is unstructured, inconsistent, and rarely reviewed.
Measurement & Metrics	System health is assessed only through user reports or manual checks with no automated monitoring in place.

Level 2 – Managed

Category	Description
People & Culture	Some teams have begun investing in monitoring after experiencing painful production incidents.
Process & Governance	Basic monitoring and alerting standards are discussed but applied inconsistently across teams and services.
Technology & Tools	Infrastructure metrics and basic application logs are collected but not correlated or easily queryable.
Measurement & Metrics	Uptime and error rate dashboards exist for some services but coverage is partial and thresholds are guessed.

Level 3 – Defined

Category	Description
People & Culture	Teams treat observability as a first-class engineering concern and new services are built with telemetry included by default.
Process & Governance	Logging, metrics, and tracing standards are documented and applied consistently, with observability reviewed as part of design.
Technology & Tools	Structured logging, distributed tracing, and metrics collection are in place across all production services.
Measurement & Metrics	Teams monitor against agreed SLIs, track error budgets, and review observability coverage in service reviews.

Level 4 – Quantitatively Managed

Category	Description
People & Culture	Engineers use observability data proactively to identify degradation trends before they become user-visible incidents.
Process & Governance	SLOs are formally defined, reviewed regularly, and observability gaps are treated as engineering work to be prioritised.
Technology & Tools	Correlated dashboards link logs, metrics, and traces to enable rapid root cause analysis across distributed service boundaries.
Measurement & Metrics	MTTR, error budget burn rates, and P99 latency trends are tracked and used to inform delivery and architecture decisions.

Level 5 – Optimising

Category	Description
People & Culture	Observability insights continuously inform system design improvements and are shared across teams as organisational learning.
Process & Governance	Observability standards evolve in response to system complexity, new delivery patterns, and emerging tooling capabilities.
Technology & Tools	Intelligent alerting, anomaly detection, and predictive analytics surface issues before users or on-call engineers notice them.
Measurement & Metrics	Observability investment is connected to business outcomes, with data demonstrating reduced incident cost and improved reliability.

Key Measures

Mean time to detect (MTTD) and mean time to recover (MTTR) across all production services tracked and trending over time
Percentage of production services meeting defined coverage thresholds for structured logs, metrics, and distributed traces
Error budget consumption rate against defined SLOs for all critical user-facing and internal services
Number of production incidents resolved using observability tooling alone, without requiring local environment reproduction
Alert signal-to-noise ratio measured as the proportion of alerts that result in genuine actionable investigation versus false positives
Percentage of new services deployed with observability instrumentation in place on day one of production deployment