Ragan McGill

Practice : Health Checks & Readiness Probes

Purpose and Strategic Importance

Health Checks and Readiness Probes enable systems to automatically detect, isolate, and recover from degraded states. They provide essential signals for orchestration platforms, load balancers, and monitoring tools to manage traffic, trigger failovers, and prevent cascading failures.

Used well, they contribute to higher availability, faster recovery, and safer deployments by ensuring systems are only exposed to traffic when fully operational and ready.

Description of the Practice

Liveness Checks verify whether an application is running - used to restart failed services.
Readiness Probes determine whether an application is ready to serve traffic - used to delay routing or load balancing.
Startup Probes help distinguish between slow-starting and stuck applications.
These are often HTTP endpoints or command-based checks integrated into container orchestration (e.g. Kubernetes).
Healthy endpoints return a 200 status; failure returns a 4xx/5xx or timeout signal.

How to Practise It (Playbook)

1. Getting Started

Implement lightweight health endpoints (e.g. /health, /ready) in your application.
Define what constitutes “healthy” - such as DB connectivity, service dependencies, cache state.
Configure liveness and readiness probes in your deployment spec (e.g. livenessProbe, readinessProbe in Kubernetes).
Test behaviour locally and in staging environments to tune thresholds and intervals.

2. Scaling and Maturing

Include synthetic checks (e.g. mock user requests) in readiness validation.
Adjust probe thresholds to avoid premature restarts or false alarms.
Track probe failures in observability dashboards to identify patterns or performance regressions.
Coordinate probes with feature toggles, canary deployments, or startup routines.
Include readiness criteria in operational runbooks and incident triage.

3. Team Behaviours to Encourage

Treat probes as contracts - they define operational expectations clearly.
Validate readiness before exposing services to traffic, not just before deployment.
Include probe failure handling in incident reviews.
Keep endpoints lean, secure, and free of expensive logic.

4. Watch Out For…

Probes that return “healthy” even when downstream services are failing.
Health checks that depend on expensive or unstable operations.
Restart loops caused by overly aggressive thresholds.
Lack of observability into probe status across environments.

5. Signals of Success

Services self-heal via orchestration based on health signals.
Incidents caused by unready services are reduced.
Deployments pause or rollback based on accurate readiness state.
Teams confidently rely on probe signals during rollouts and upgrades.
Health checks are treated as a critical part of the service contract.