• Home
  • BVSSH
  • Engineering Enablement
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Practice : Health Checks & Readiness Probes

Purpose and Strategic Importance

Health Checks and Readiness Probes enable systems to automatically detect, isolate, and recover from degraded states. They provide essential signals for orchestration platforms, load balancers, and monitoring tools to manage traffic, trigger failovers, and prevent cascading failures.

Used well, they contribute to higher availability, faster recovery, and safer deployments by ensuring systems are only exposed to traffic when fully operational and ready.


Description of the Practice

  • Liveness Checks verify whether an application is running - used to restart failed services.
  • Readiness Probes determine whether an application is ready to serve traffic - used to delay routing or load balancing.
  • Startup Probes help distinguish between slow-starting and stuck applications.
  • These are often HTTP endpoints or command-based checks integrated into container orchestration (e.g. Kubernetes).
  • Healthy endpoints return a 200 status; failure returns a 4xx/5xx or timeout signal.

How to Practise It (Playbook)

1. Getting Started

  • Implement lightweight health endpoints (e.g. /health, /ready) in your application.
  • Define what constitutes “healthy” - such as DB connectivity, service dependencies, cache state.
  • Configure liveness and readiness probes in your deployment spec (e.g. livenessProbe, readinessProbe in Kubernetes).
  • Test behaviour locally and in staging environments to tune thresholds and intervals.

2. Scaling and Maturing

  • Include synthetic checks (e.g. mock user requests) in readiness validation.
  • Adjust probe thresholds to avoid premature restarts or false alarms.
  • Track probe failures in observability dashboards to identify patterns or performance regressions.
  • Coordinate probes with feature toggles, canary deployments, or startup routines.
  • Include readiness criteria in operational runbooks and incident triage.

3. Team Behaviours to Encourage

  • Treat probes as contracts - they define operational expectations clearly.
  • Validate readiness before exposing services to traffic, not just before deployment.
  • Include probe failure handling in incident reviews.
  • Keep endpoints lean, secure, and free of expensive logic.

4. Watch Out For…

  • Probes that return “healthy” even when downstream services are failing.
  • Health checks that depend on expensive or unstable operations.
  • Restart loops caused by overly aggressive thresholds.
  • Lack of observability into probe status across environments.

5. Signals of Success

  • Services self-heal via orchestration based on health signals.
  • Incidents caused by unready services are reduced.
  • Deployments pause or rollback based on accurate readiness state.
  • Teams confidently rely on probe signals during rollouts and upgrades.
  • Health checks are treated as a critical part of the service contract.
Associated Standards
  • Systems recover quickly and fail safely
  • Operational readiness is tested before every major release
  • Operational tasks are automated before they become recurring toil
  • Policy enforcement is automated across environments
  • Developer workflows are fast and frictionless
  • Failure modes are proactively tested

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering