Ragan McGill

Practice : Custom Metrics Instrumentation

Purpose and Strategic Importance

Custom Metrics Instrumentation enables teams to capture business-specific and system-specific telemetry that off-the-shelf metrics don't provide. These metrics help surface meaningful performance insights, uncover user behaviour patterns, and support fine-grained monitoring for critical workflows.

By instrumenting what truly matters, teams build smarter alerts, detect issues faster, and make better decisions based on precise, relevant data.

Description of the Practice

Custom metrics are numeric time-series data points developers explicitly add to code to monitor important aspects of system behaviour or business value.
They include counters, gauges, histograms, and timers related to domain-specific events (e.g. “orders placed,” “checkout errors,” “email sends per region”).
Collected via tools like Prometheus, OpenTelemetry, StatsD, or cloud-native APMs (e.g. AWS CloudWatch, Azure Monitor, Datadog).
Metrics are exported, visualised, and queried for insights, trend detection, and anomaly alerting.

How to Practise It (Playbook)

1. Getting Started

Identify critical workflows or KPIs that are not visible in default metrics.
Instrument application code to emit custom metrics at key events and state changes.
Use structured metric naming and tags (e.g. service, environment, region) for aggregation and filtering.
Export metrics to a central observability platform with a clear retention policy.

2. Scaling and Maturing

Pair custom metrics with alerting rules, dashboards, and annotations (e.g. deployments, incidents).
Build SLIs from custom metrics (e.g. “successful payments per minute”) to support SLOs.
Create business-level observability - not just infrastructure metrics - to link technical health to outcomes.
Document all metrics: what they mean, where they come from, and how to interpret them.
Continuously prune unused metrics to manage cost and reduce noise.

3. Team Behaviours to Encourage

Think beyond infrastructure - log what customers care about.
Collaborate with product, ops, and business teams on what to measure.
Use metrics in sprint reviews, post-incident analysis, and decision-making forums.
Keep metrics consistent, portable, and easy to understand.

4. Watch Out For…

Metric explosion - too many dimensions or duplicates driving up cardinality and cost.
Lack of standardisation - inconsistent naming, units, or labelling.
Instrumenting only technical components, ignoring user or business metrics.
Writing metrics but not using them in operational or strategic discussions.

5. Signals of Success

Teams have clear, reliable metrics that reflect user and system behaviour.
Alerting based on metrics leads to timely, actionable responses.
Product and engineering decisions are informed by real-time usage patterns.
Incidents are resolved faster with clearer visibility into root causes.
Custom metrics become part of delivery best practices, not an afterthought.