← Management Playbooks

FinOps for Engineering Managers

Cloud costs are engineering decisions. Engineers who do not see the bill make expensive choices.

FinOps is the practice of bringing financial accountability to cloud spending. For engineering managers, it means understanding where your cloud costs come from, which teams and services drive them, and how to create a culture of cost-conscious engineering without slowing delivery.

Purpose

This playbook gives engineering managers a working understanding of FinOps - what it is, how to implement the basics, and how to build a team culture where engineers care about cloud costs without becoming paralysed by them.

Cloud spending is a direct consequence of engineering decisions. The architecture someone chooses, the environment they leave running, the data they store and for how long - all of these produce a bill. If engineers never see the bill, they cannot make good trade-off decisions. If they see the bill but have no context, they make fearful ones. This playbook helps you get it right.


When to Use This Playbook

  • Your cloud costs are growing faster than your team or your product usage
  • You have been asked to reduce cloud spend but do not know where to start
  • Your team has no visibility of what they cost to run
  • You are setting up cost accountability for a new team or service
  • You are preparing for a FinOps conversation with your finance or engineering leadership
  • Your organisation is moving from a centralised cloud account to team-level cost accountability

Before You Start

Gather the following:

  • Access to your cloud provider's cost console (AWS Cost Explorer, Azure Cost Management, GCP Cost Dashboard)
  • Your organisation's cloud tagging strategy (if one exists)
  • A rough picture of your team's services and the environments they run in
  • Your allocated cloud budget for the current financial year
  • Contact for your cloud platform or FinOps team (if one exists in your organisation)

Ask these questions first:

  • Do we have team-level cost visibility today, or is everything in one account?
  • What tagging standards exist for our cloud resources?
  • Who owns the cloud account(s) that my team's services run in?
  • Is there a showback or chargeback model in place (i.e. are teams charged for their cloud use internally)?
  • What cost alerting is currently configured?

What FinOps Is and Is Not

What FinOps Is

FinOps is the practice of bringing financial accountability to variable cloud spend. It is about:

  • Making cloud costs visible at the team and service level
  • Giving engineers the information they need to make cost-aware decisions
  • Creating shared accountability between engineering, finance, and product
  • Optimising unit economics - cost per transaction, cost per user, cost per feature - rather than just cutting spend

The FinOps Foundation defines it as "a cultural practice and operational framework" - it is as much about behaviour as it is about tooling.

What FinOps Is Not

  • A cost-cutting programme disguised as a culture change
  • A reason to second-guess every engineering decision on cost grounds
  • The job of finance alone
  • Penny-pinching at the expense of delivery speed or system reliability
  • A one-time project with a defined end date

The goal is not to spend as little as possible. The goal is to spend appropriately - to understand what you are getting for what you are spending, and to make deliberate trade-offs.


The Three Phases of FinOps Maturity

The FinOps Foundation describes three phases of maturity. Most engineering teams start at Inform and gradually move through Optimise to Operate.

Phase 1 - Inform

You can see your costs. You know roughly what you spend, by service or by team. Cost data is available and reviewed at least monthly.

Signals you are here:

  • You can pull a cloud cost report for your team's services
  • You review cloud costs in your monthly budget review
  • Engineers know cloud costs exist but do not routinely think about them

What to do at this phase:

  • Get cost visibility set up (tagging, dashboards, reports)
  • Establish a monthly cost review cadence
  • Make sure every engineer knows they can see the bill and how to access it

Phase 2 - Optimise

You are actively looking for and acting on optimisation opportunities. You have a process for reviewing costs, identifying waste, and making changes.

Signals you are here:

  • You regularly review right-sizing and reserved instance coverage
  • Teams have a backlog of cost-reduction work alongside feature work
  • Engineers flag expensive patterns during code and architecture review

What to do at this phase:

  • Build cost review into your sprint or delivery cadence
  • Track a unit cost metric (cost per active user, cost per API call, etc.)
  • Run a regular "cost walk" - a structured review of spend by service

Phase 3 - Operate

Cost awareness is embedded in your engineering culture. Cost trade-offs are part of design and delivery decisions. You forecast cloud costs with confidence.

Signals you are here:

  • Cost impact is considered in technical design discussions as a matter of course
  • You can forecast cloud spend accurately alongside your team's roadmap
  • Cost incidents (unexpected cost spikes) are detected and resolved quickly
  • Engineers propose cost optimisations proactively

Most teams spend most of their time in Phase 1 and 2. Getting to Phase 3 takes time and requires sustained management attention.


How to Read a Cloud Cost Dashboard

Cloud provider dashboards are powerful but can be overwhelming. Here is what to focus on.

AWS Cost Explorer

  • Service breakdown - which AWS services are driving your costs (EC2, RDS, S3, data transfer, etc.)
  • Linked accounts - if your organisation uses multiple accounts, which account is spending what
  • Tag breakdown - if resources are tagged by team or service, you can filter to your team
  • Month-over-month trend - is your spend growing, stable, or declining?
  • Savings Plans and Reserved Instance coverage - what percentage of your compute is covered by commitments vs on-demand pricing

Azure Cost Management

  • Cost analysis - spend by resource group, tag, service, or subscription
  • Budgets - whether alerts have been configured for your team's spend
  • Recommendations - Azure's built-in suggestions for right-sizing and reserved instance purchase

GCP Cost Dashboard

  • Project-level spend - each GCP project has its own billing data
  • Labels - GCP's equivalent of tags, used for team-level cost allocation
  • Committed use discounts - coverage of compute with committed use

What to look for in any dashboard:

  1. What is my total spend this month vs last month? Is the trend up or down?
  2. What are my top 5 cost drivers? Are they expected?
  3. Is there anything I do not recognise?
  4. What is my forecast for the end of the month?
  5. Are any cost alerts firing?

Key Cloud Cost Drivers in Engineering

Understanding what drives costs helps you know where to look when costs increase.

Cost Driver What Causes It What to Watch For
Compute (EC2, VMs, GKE nodes) Running instances 24/7, oversized instances, autoscaling not configured Non-production environments running overnight and at weekends
Storage (S3, Blob, GCS) Data accumulation, unused snapshots, wrong storage tier S3 lifecycle policies not configured, RDS snapshots never deleted
Data transfer / egress Moving data between regions, between services, to the internet Architectures that move large volumes of data across availability zones
Managed databases (RDS, Cloud SQL) Instance size, multi-AZ, backup retention, provisioned IOPS Development databases running on production-tier instances
Managed services (OpenSearch, MSK, etc.) Always-on managed services with no right-sizing Services provisioned for peak load and never reviewed
Serverless (Lambda, Cloud Functions) Invocation volume and execution time Inefficient functions with high memory allocation and long runtimes
Logging and observability Log ingestion volume, retention period, metric resolution Debug-level logging running in production

The most common source of waste in most engineering teams is non-production environments running unnecessarily - particularly outside business hours and at weekends.


How to Tag Resources for Cost Allocation

Tags (or labels in GCP) are key-value metadata attached to cloud resources. They are the primary mechanism for allocating cloud costs to teams and services.

Why tagging matters

Without tags, your cloud bill is a single number. With tags, you can see how much each team, service, and environment costs. This is the foundation of FinOps.

Minimum viable tagging schema

Tag Key Example Value Purpose
team payments-platform Which team owns this resource
service checkout-api Which service or product this belongs to
environment production prod / staging / dev / sandbox
cost-centre TECH-123 Finance cost centre for allocation
managed-by terraform How the resource was provisioned

How to enforce tagging

  • Build tag requirements into your Infrastructure as Code (IaC) templates - a Terraform module that does not include the standard tags should fail validation
  • Use AWS Config rules, Azure Policy, or GCP Organisation Policies to alert on or prevent untagged resource creation
  • Run a quarterly tag compliance report and assign remediation to the owning team

Starting from zero

If your team's resources are not tagged today, do not try to fix everything at once:

  1. Agree the minimum tag schema with your platform team
  2. Apply tags to your highest-cost resources first
  3. Build tagging into your next IaC refactor
  4. Track tag coverage as a metric and improve it over time

How to Set and Track Budgets by Team and Service

Setting a cloud budget

Your cloud budget should be derived from your team's roadmap, not from last year's actuals. Ask:

  • What services will we be running?
  • What traffic or usage growth do we expect?
  • Are we launching anything new that will significantly change our cost profile?
  • Are there any planned optimisations that will reduce costs?

Build a bottom-up estimate: list each service, estimate its monthly cost, and sum them. Compare this to your top-down budget allocation. If there is a gap, make it visible early.

Tracking against budget

  • Configure cloud budget alerts to fire at 80% and 100% of your monthly budget
  • Review cost dashboards weekly - not just when the finance report arrives
  • Track a cost-per-unit metric alongside raw spend. Raw spend going up is fine if it reflects growth. Cost per active user going up signals inefficiency.

Budget alerts - example thresholds to configure

Alert Threshold Action
Warning 80% of monthly budget by the 20th of the month Review dashboard, identify cause
Critical 100% of monthly budget Escalate to manager, identify immediate actions
Anomaly Daily spend more than 2x the trailing 7-day average Investigate immediately - likely a misconfiguration

How to Have the Cost Conversation with Engineers

This is the most important part and the most commonly mishandled.

What not to do

  • Do not present engineers with a cost report and ask them to cut 20%. This creates fear without context.
  • Do not treat every cost increase as a problem. Cost grows with scale and feature delivery.
  • Do not make individuals feel personally responsible for a cost spike they did not cause.
  • Do not hold cost conversations only when there is a problem.

What to do instead

Frame it as unit economics, not penny-pinching:

"Our cloud spend has grown by 40% this quarter. Our active users have grown by 25%. So our cost per user has gone up. I want us to understand why and whether it is appropriate or whether there is waste we can remove."

Make cost visible as a normal part of engineering:

Add a "cloud cost" section to your team's weekly metrics review. It should sit alongside deployment frequency, incident rate, and lead time - not be treated as a special topic that only comes up when something is wrong.

Involve engineers in the analysis:

"I have pulled our cost breakdown by service. Can you take a look at the RDS line? It seems high relative to our usage. I want to understand it before I make any assumptions."

Recognise good cost behaviour:

When an engineer finds and eliminates waste, make it visible. "Tom found that our development RDS instances were running at production size 24/7. Switching them to a smaller instance with automated shutdown outside business hours saves us £800 a month. That is the kind of thinking I want us to normalise."


Common Quick Wins

These are the optimisation actions that consistently deliver results with low risk.

1. Shut down non-production environments outside business hours

Typical saving: 60-70% of non-production compute costs

Most development, staging, and QA environments run 24 hours a day, 7 days a week. They are typically used 9am-6pm Monday to Friday. Automating shutdown outside those hours reduces their compute cost by roughly two thirds.

Implementation: Use AWS Instance Scheduler, Azure Automation, or a simple cron job running in your CI/CD platform to stop and start instances on a schedule.

2. Delete unused resources

Typical saving: 5-15% of total cloud spend

Old snapshots, unused Elastic IPs, orphaned load balancers, forgotten test environments - these accumulate silently. Run a cleanup sprint every quarter.

3. Right-size oversized instances

Typical saving: 20-40% of compute costs

Instances are often provisioned at a size that made sense at launch but is no longer appropriate. Check CPU and memory utilisation. If instances are consistently below 30% utilisation, they are probably oversized. Most cloud providers offer right-sizing recommendations in their cost consoles.

4. Move to reserved instances or savings plans for stable workloads

Typical saving: 30-40% vs on-demand pricing

If you have compute that will run continuously for 12 months or more, committed use discounts (Reserved Instances, Savings Plans, Committed Use Discounts) offer significant savings with low risk.

5. Review logging and observability costs

Typical saving: 10-30% of observability spend

Log ingestion and retention is often the fastest-growing cost line in engineering budgets. Review log retention periods, move from debug to info logging in production, and check whether all metrics need to be at 1-minute resolution.

6. Enable S3 lifecycle policies

Typical saving: 20-50% of storage costs

Data stored in S3 or equivalent object storage accumulates without limit unless lifecycle policies move it to cheaper tiers or delete it after a defined retention period. Most teams do not have lifecycle policies configured.


How to Build Cost Awareness into Engineering Culture

Cultural change is slow. These practices, sustained over time, shift the default.

Make costs visible in the right places:

  • Cloud cost dashboard on the team's Confluence page or internal wiki
  • Cost metrics in the team's weekly metrics review
  • Cost impact in architecture decision records (ADRs)

Build cost into your development process:

  • Add "what is the cost implication of this change?" to your pull request template
  • Include cost review in architecture and design reviews for new services
  • Include environment cost as a metric when a service is launched

Run a quarterly cost walk:

A cost walk is a structured 60-90 minute session where the team reviews their cloud costs together, service by service. The goal is to build understanding and find optimisation opportunities.

Format:

  1. Pull up the cost dashboard with the last 90 days of data (15 minutes)
  2. Walk through each service - what is running, what it costs, whether it makes sense (45 minutes)
  3. Capture a list of identified optimisations, each with an owner and a timeline (15 minutes)
  4. Add the optimisations to the team backlog

Celebrate savings:

Put cost savings on the same level as feature delivery. A £2,000/month saving is real, recurring value. Make it visible.


What Good Looks Like

A team with good FinOps practice:

  • Has tagged all resources with team and service identifiers
  • Reviews cloud costs weekly as part of normal team metrics
  • Has cost alerts configured and responding to them
  • Knows their unit cost metric and tracks it over time
  • Runs non-production environments only during business hours
  • Reviews right-sizing recommendations quarterly
  • Has a committed use discount coverage above 60% for stable workloads
  • Can forecast their cloud spend for the next quarter with reasonable accuracy
  • Treats cost as a quality dimension of engineering - not a separate concern

Common Failures

Treating FinOps as a finance project. Cloud costs are an engineering output. Finance can report them but cannot change them. Engineers need to own this.

Starting with tools rather than visibility. Before you invest in third-party FinOps tooling, make sure you understand what your cloud provider's native cost tools tell you. Most teams do not need additional tools at Phase 1.

Making cost the primary engineering concern. Cost is one dimension of engineering quality, alongside reliability, security, and speed of delivery. Teams that optimise cost at the expense of reliability are making a bad trade.

Running cost reduction programmes rather than building cost culture. A sprint dedicated to cost reduction produces a one-time saving. Embedding cost awareness in normal engineering practice produces ongoing savings and prevents future waste.

Ignoring data transfer costs. Compute and storage are visible. Data transfer is often not reviewed until it becomes a large and unexpected line item. Architectures that move data between regions or out to the internet need cost analysis at design time.

Giving up on tagging because it is hard. It is hard. Do it anyway. Untagged resources make cost accountability impossible.


Checklist

Getting Started Checklist

  • Cloud cost console access confirmed for manager and team
  • Current monthly cloud spend understood
  • Top 5 cost drivers identified
  • Tagging strategy defined and documented
  • Tag compliance assessed - percentage of resources tagged
  • Cost alerts configured at 80% and 100% of monthly budget
  • Anomaly detection enabled in cost console
  • Monthly cloud cost review cadence established

Monthly Cost Review Checklist

  • Cloud cost dashboard reviewed (spend vs budget, trend)
  • Cost-per-unit metric updated
  • Any cost anomalies investigated and explained
  • Right-sizing recommendations reviewed
  • Non-production environment schedule confirmed
  • Cost optimisation backlog reviewed and prioritised

Quarterly Cost Walk Checklist

  • Cost walk session scheduled (60-90 minutes)
  • Last 90 days of cost data pulled by service
  • Each service reviewed - cost vs usage
  • Optimisation opportunities identified and captured
  • Optimisations added to team backlog with owners
  • Summary of findings shared with team and manager

Cost Culture Checklist

  • Cloud cost metrics included in team's weekly review
  • Cost impact included in PR and design review process
  • Environment shutdown automation in place for non-production
  • Last cost savings recognised and communicated to team
  • FinOps maturity phase assessed and target phase agreed