Ragan McGill

Practice : Application Performance Monitoring (APM)

Purpose and Strategic Importance

Application Performance Monitoring (APM) provides continuous insight into the performance, availability, and health of applications. It enables teams to detect anomalies, diagnose bottlenecks, and maintain service quality - even as systems scale and evolve.

Effective APM empowers engineering teams to act with confidence, reduce mean time to resolution (MTTR), and optimise customer experience by making data-driven decisions grounded in live telemetry.

Description of the Practice

APM tools collect and analyse real-time data about application behaviour, system resources, and user interactions.
Key metrics include response times, error rates, throughput, latency, memory usage, and database calls.
Tools like New Relic, Datadog, AppDynamics, Dynatrace, and OpenTelemetry-based platforms offer rich dashboards, alerts, and insights.
APM often includes tracing, log correlation, synthetic monitoring, and service maps for comprehensive observability.

How to Practise It (Playbook)

1. Getting Started

Integrate your application with an APM agent or SDK to start collecting baseline metrics.
Monitor core services and customer-facing endpoints for availability and performance.
Set up visual dashboards and basic alerts for key indicators (e.g. 95th percentile latency, error rate thresholds).
Educate the team on how to interpret metrics and use the APM interface.

2. Scaling and Maturing

Define and track Service-Level Indicators (SLIs) and Service-Level Objectives (SLOs) across services.
Correlate APM data with logs and traces to accelerate root cause analysis.
Instrument business-critical workflows (e.g. checkout, login, search) to ensure performance at the edge.
Use anomaly detection or ML-based alerting to surface emerging issues early.
Continuously evolve monitoring coverage based on system changes and incident learnings.

3. Team Behaviours to Encourage

Review APM dashboards during stand-ups, retros, or post-incident reviews.
Use performance data to guide technical debt reduction and system tuning.
Promote shared ownership of performance - not just for ops or platform teams.
View metrics as indicators for learning, not just triggers for escalation.

4. Watch Out For…

Alert fatigue from noisy or poorly scoped thresholds.
Over-monitoring without prioritisation - focus on what matters most.
Blind spots - missing telemetry from critical paths or third-party dependencies.
Teams not engaging with APM tools due to complexity or lack of context.

5. Signals of Success

Teams respond to issues faster with clear insights into root causes.
Application performance trends are visible and acted on proactively.
Customer experience improves through reduced latency and downtime.
Metrics drive continuous improvement and system resilience.
APM becomes a valued tool embedded in daily engineering practice.