Standard : Service Availability (Uptime)
Description
Service Availability (Uptime) measures the percentage of time a system or service is operational and accessible to users as expected. It is a core indicator of system reliability and customer experience, commonly tracked using service-level indicators (SLIs) and expressed as a percentage (e.g. “four nines” = 99.99%).
Availability is often defined in your SLOs and directly tied to user trust, regulatory compliance, and business continuity.
How to Use
What to Measure
- The ratio of time a service is available vs. the total time it is expected to be available.
- Can be scoped by system, region, product area, or business-critical functionality.
Availability (%) = [(Total Time - Downtime) / Total Time] x 100
Instrumentation Tips
- Use uptime monitoring tools (e.g. Pingdom, Datadog, Prometheus) and SLIs to track availability in real time.
- Define meaningful failure states (e.g. slow response, partial outage) to reflect user-perceived availability.
- Align measurements with your error budget and SLA expectations.
Why It Matters
- Customer trust: Downtime directly impacts user experience and brand reputation.
- Business continuity: High availability is essential for revenue-critical services.
- Operational focus: Guides investment in resilience, failover, and redundancy.
- Engineering alignment: Links technical health to user and business outcomes.
Best Practices
- Define and publish SLOs that reflect real user needs.
- Use synthetic checks and real-user monitoring to measure perceived uptime.
- Set clear alerting thresholds and track alert precision.
- Conduct incident reviews for any downtime outside of acceptable error budgets.
- Continuously validate availability monitoring with failover testing.
Common Pitfalls
- Only measuring endpoint ping checks—ignore user experience or partial outages.
- Inflating availability stats by excluding maintenance or hidden downtime.
- Tracking uptime manually or inconsistently across systems.
- Lacking ownership or visibility into third-party service availability.
Signals of Success
- Availability meets or exceeds SLOs over time.
- Downtime causes are well-understood and resolved quickly.
- Availability trends are part of engineering dashboards and OKRs.
- Users experience stable, consistent access to your services.
- [[Error Budget Consumption]]
- [[Incident Frequency]]
- [[Mean Time to Detect (MTTD)]]
- [[Mean Time to Recovery (MTTR)]]
- [[Service Level Objective (SLO) Compliance]]