Standard : Systems provide real-time visibility into performance, behaviour, and health
Purpose and Strategic Importance
Systems that cannot be observed cannot be reliably operated or improved. This standard establishes that structured logging, distributed tracing, and metrics must be treated as core engineering requirements — designed in from the outset, not retrofitted after incidents expose gaps. When teams have genuine observability, they can understand system behaviour in production without needing to reproduce problems locally, dramatically reducing mean time to detect and recover from failures.
Observability is the foundation of confident delivery. The three pillars — logs, metrics, and traces — provide complementary lenses on system health: logs capture discrete events, metrics surface aggregated trends, and traces reveal the path of requests through distributed systems. Together, they empower engineers to ask novel questions about system behaviour, support informed capacity planning, and give product and operations teams the shared situational awareness needed to act with confidence.
Strategic Impact
- Engineers can diagnose and resolve production incidents faster, reducing customer impact and shortening mean time to recovery across all critical services.
- Teams gain the confidence to deploy frequently, knowing they can detect regressions or degradations in real time before they compound into major failures.
- Observability data drives informed architectural and capacity decisions, replacing speculation with evidence and reducing costly over-provisioning or under-provisioning.
- Shared visibility across engineering, operations, and product teams reduces information silos, accelerates root cause analysis, and aligns everyone on system performance against business objectives.
Risks of Not Having This Standard
- Production issues go undetected until customers report them, resulting in extended outages, reputational damage, and eroded user trust.
- Engineers spend significant time and effort reproducing issues locally rather than diagnosing them directly in the production environment where they actually occurred.
- Incident response is slowed by lack of correlated data, causing teams to work from assumption and guesswork rather than observable evidence.
- Capacity and performance problems escalate silently until they reach critical thresholds, removing the opportunity for proactive intervention.
- Teams accumulate hidden operational risk as system complexity grows faster than the team's ability to understand and reason about system behaviour.
CMMI Maturity Model
Level 1 – Initial
| Category |
Description |
| People & Culture |
Observability is not a shared concern and visibility into production systems is reactive and ad hoc. |
| Process & Governance |
There are no standards for logging, monitoring, or alerting and practices vary entirely by individual choice. |
| Technology & Tools |
Basic application logging exists in some services but is unstructured, inconsistent, and rarely reviewed. |
| Measurement & Metrics |
System health is assessed only through user reports or manual checks with no automated monitoring in place. |
Level 2 – Managed
| Category |
Description |
| People & Culture |
Some teams have begun investing in monitoring after experiencing painful production incidents. |
| Process & Governance |
Basic monitoring and alerting standards are discussed but applied inconsistently across teams and services. |
| Technology & Tools |
Infrastructure metrics and basic application logs are collected but not correlated or easily queryable. |
| Measurement & Metrics |
Uptime and error rate dashboards exist for some services but coverage is partial and thresholds are guessed. |
Level 3 – Defined
| Category |
Description |
| People & Culture |
Teams treat observability as a first-class engineering concern and new services are built with telemetry included by default. |
| Process & Governance |
Logging, metrics, and tracing standards are documented and applied consistently, with observability reviewed as part of design. |
| Technology & Tools |
Structured logging, distributed tracing, and metrics collection are in place across all production services. |
| Measurement & Metrics |
Teams monitor against agreed SLIs, track error budgets, and review observability coverage in service reviews. |
Level 4 – Quantitatively Managed
| Category |
Description |
| People & Culture |
Engineers use observability data proactively to identify degradation trends before they become user-visible incidents. |
| Process & Governance |
SLOs are formally defined, reviewed regularly, and observability gaps are treated as engineering work to be prioritised. |
| Technology & Tools |
Correlated dashboards link logs, metrics, and traces to enable rapid root cause analysis across distributed service boundaries. |
| Measurement & Metrics |
MTTR, error budget burn rates, and P99 latency trends are tracked and used to inform delivery and architecture decisions. |
Level 5 – Optimising
| Category |
Description |
| People & Culture |
Observability insights continuously inform system design improvements and are shared across teams as organisational learning. |
| Process & Governance |
Observability standards evolve in response to system complexity, new delivery patterns, and emerging tooling capabilities. |
| Technology & Tools |
Intelligent alerting, anomaly detection, and predictive analytics surface issues before users or on-call engineers notice them. |
| Measurement & Metrics |
Observability investment is connected to business outcomes, with data demonstrating reduced incident cost and improved reliability. |
Key Measures
- Mean time to detect (MTTD) and mean time to recover (MTTR) across all production services tracked and trending over time
- Percentage of production services meeting defined coverage thresholds for structured logs, metrics, and distributed traces
- Error budget consumption rate against defined SLOs for all critical user-facing and internal services
- Number of production incidents resolved using observability tooling alone, without requiring local environment reproduction
- Alert signal-to-noise ratio measured as the proportion of alerts that result in genuine actionable investigation versus false positives
- Percentage of new services deployed with observability instrumentation in place on day one of production deployment