Practice : Custom Metrics Instrumentation
Purpose and Strategic Importance
Custom Metrics Instrumentation enables teams to capture business-specific and system-specific telemetry that off-the-shelf metrics don't provide. These metrics help surface meaningful performance insights, uncover user behaviour patterns, and support fine-grained monitoring for critical workflows.
By instrumenting what truly matters, teams build smarter alerts, detect issues faster, and make better decisions based on precise, relevant data.
Description of the Practice
- Custom metrics are numeric time-series data points developers explicitly add to code to monitor important aspects of system behaviour or business value.
- They include counters, gauges, histograms, and timers related to domain-specific events (e.g. “orders placed,” “checkout errors,” “email sends per region”).
- Collected via tools like Prometheus, OpenTelemetry, StatsD, or cloud-native APMs (e.g. AWS CloudWatch, Azure Monitor, Datadog).
- Metrics are exported, visualised, and queried for insights, trend detection, and anomaly alerting.
How to Practise It (Playbook)
1. Getting Started
- Identify critical workflows or KPIs that are not visible in default metrics.
- Instrument application code to emit custom metrics at key events and state changes.
- Use structured metric naming and tags (e.g. service, environment, region) for aggregation and filtering.
- Export metrics to a central observability platform with a clear retention policy.
2. Scaling and Maturing
- Pair custom metrics with alerting rules, dashboards, and annotations (e.g. deployments, incidents).
- Build SLIs from custom metrics (e.g. “successful payments per minute”) to support SLOs.
- Create business-level observability - not just infrastructure metrics - to link technical health to outcomes.
- Document all metrics: what they mean, where they come from, and how to interpret them.
- Continuously prune unused metrics to manage cost and reduce noise.
3. Team Behaviours to Encourage
- Think beyond infrastructure - log what customers care about.
- Collaborate with product, ops, and business teams on what to measure.
- Use metrics in sprint reviews, post-incident analysis, and decision-making forums.
- Keep metrics consistent, portable, and easy to understand.
4. Watch Out For…
- Metric explosion - too many dimensions or duplicates driving up cardinality and cost.
- Lack of standardisation - inconsistent naming, units, or labelling.
- Instrumenting only technical components, ignoring user or business metrics.
- Writing metrics but not using them in operational or strategic discussions.
5. Signals of Success
- Teams have clear, reliable metrics that reflect user and system behaviour.
- Alerting based on metrics leads to timely, actionable responses.
- Product and engineering decisions are informed by real-time usage patterns.
- Incidents are resolved faster with clearer visibility into root causes.
- Custom metrics become part of delivery best practices, not an afterthought.