RING 1 . RING 2 . RING 3 . RING 4

Kubernetes FinOps

Make every pod, namespace, and node carry its own cost and tie it to a business owner.

I.Why this matters

Kubernetes hides cost behind shared nodes. Without per namespace allocation, a single team can quietly burn six figures a month while the platform team only sees a flat node bill. Kubernetes FinOps makes the hidden visible. It binds every pod to a cost owner, every namespace to a budget, every node to a workload class, and every idle resource to a removal plan.

II.Principles

Allocate before you optimize. Cost without an owner has no force behind it.
Right size from observed usage, not from request defaults that engineers copy and paste.
Bin pack on commitment classes. Reserved and Savings Plan capacity should run the steady load; Spot and preemptible should run the elastic tail.
Treat idleness as a defect. An idle pod is an unfilled bug ticket, not a nuance.
Make namespaces budget aware. Quotas without dollars do not change behaviour.
GPUs deserve their own scheduling lane. Mixing CPU and GPU workloads on the same nodes wastes both.
Publish every monthly bill back to the team that caused it. Sunlight is the cheapest control.

III.KPIs

Name	Target	Computation
Cost per namespace per month	Defined budget per namespace, variance under 10 percent	Sum of node-hours allocated to namespace times effective node hourly rate, joined to BigQuery billing export.
Pod CPU efficiency	p50 above 35 percent, p95 above 60 percent	Average observed CPU usage divided by CPU request, rolled up over 7 days per workload.
Pod memory efficiency	p50 above 50 percent, p95 above 75 percent	Average observed memory usage divided by memory request, 7 day window.
Idle node hours per week	Under 5 percent of total node hours	Hours where node CPU under 5 percent and memory under 10 percent. From Prometheus or Cloud Monitoring.
Spot or preemptible adoption	Above 30 percent of stateless workload node hours	Spot node hours used by stateless workloads divided by total stateless node hours.
Untagged pod ratio	Zero	Pods missing required labels (cost-center, owner, product) divided by total pods, scanned daily.
Cost per request	Tracked and trending down per release	Monthly namespace cost divided by total ingress requests served by that namespace.

IV.The playbook spine

Stand up cost allocation. Install a per pod cost engine (OpenCost or equivalent) and pipe the export into BigQuery alongside your billing export.
Mandate three labels at admission. Add a Gatekeeper or Kyverno policy that rejects any pod missing cost-center, owner, and product labels.
Right size with observed usage. Run a Vertical Pod Autoscaler in recommendation mode for two weeks, then apply the recommendations in PR.
Set namespace ResourceQuota and LimitRange. Match the quota to the agreed budget; alert at 80 percent.
Move stateless workloads to Spot or preemptible node pools. Use a separate node pool with taints and tolerations so only stateless workloads land there.
Schedule a weekly idle scan. Walk every Deployment for HPA min replicas and replicas with zero traffic; open a ticket per offender.
Publish a monthly per team bill. One page per team: cost, efficiency, top three drivers. Send to the team and to their VP. Repeat every month.

V.Common failures

Treating node cost as the unit. Nodes are a packaging detail. Allocate at pod and namespace.
Setting CPU and memory limits equal to requests for every workload. This eliminates burst headroom and inflates the bill.
Running HPA on CPU only when the workload is memory bound. The autoscaler never fires and the cluster overprovisions.
Letting GPU nodes run at 20 percent utilization for weeks. Without a queue and a per job tracker, GPU spend hides inside compute.
Allowing kube-system, monitoring, and ingress overhead to grow unchecked. Platform overhead can quietly hit 25 percent of node spend.
Cordoning Spot nodes after one preemption incident and never going back. The wrong workload was on Spot, not the wrong strategy.
Producing a beautiful chargeback dashboard that no engineering manager reads.

VI.Recommended tooling

Vendor neutral. For graded vendor comparisons see the Matrix.

Cost allocation engine (OpenCost, Kubecost-class)

HPA and VPA tuner

Namespace policy enforcer (Gatekeeper, Kyverno)

Idle workload scanner

GPU job scheduler

Spot interruption handler

Bin packing optimizer

Per team chargeback report generator

VII.Related IFO4 playbooks

tag-enforcement-at-provisioning . coming soon
reserved-instance-portfolio . coming soon
Tier ACarbon-Aware ML Training

VIII.FAQ

Should I use Spot for stateful workloads?

Generally no. Spot is appropriate for stateless or restartable jobs. Stateful databases and singletons should run on committed capacity. Mixed configurations are technically possible but rarely worth the operational cost.

How do I allocate shared services like ingress and DNS?

Allocate shared services proportional to namespace egress or pod count, then publish the allocation method on the chargeback report so teams can audit it. Hidden allocation methods erode trust faster than imperfect ones.

OpenCost or a paid tool?

OpenCost is the right starting point. It is open source, vendor neutral, and produces the same cost model that paid tools wrap. Move to a paid tool when you need multi cluster aggregation, optimization recommendations, or business unit reporting beyond what BigQuery joins can give you.

How do I size limits without breaking things?

Use VPA in recommendation mode for two weeks, then apply the p95 recommendation as the request and 1.5x p95 as the limit. Re run quarterly.

What about karpenter or cluster autoscaler?

Both work. Karpenter typically packs better and reacts faster on AWS; cluster autoscaler is the default elsewhere. The decision is operational, not financial. Either way, ensure the autoscaler is allowed to scale down aggressively and that idle nodes are torn down within ten minutes.

How do I justify GPU spend to the CFO?

Track cost per training run and cost per million inference tokens. Tie those numbers to a specific revenue model output. The CFO understands cost per outcome; they do not need to understand A100 versus H100.

Do I need a chargeback or a showback model?

Start with showback. Show every team their bill and their efficiency for two months. Then move to chargeback only if the showback alone has not changed behaviour. Chargeback creates accounting friction; showback is enough most of the time.

IX.Further reading

Take this to your CFO

Compute your Score V2, assess your maturity, and prove the practice in lab.

Score V2 Maturity assessment Buy a Proven Lab attempt for this topic (PROV-K8S)

Name

Target

Computation

Cost per namespace per month

Defined budget per namespace, variance under 10 percent

Sum of node-hours allocated to namespace times effective node hourly rate, joined to BigQuery billing export.

Pod CPU efficiency

p50 above 35 percent, p95 above 60 percent

Average observed CPU usage divided by CPU request, rolled up over 7 days per workload.

Pod memory efficiency

p50 above 50 percent, p95 above 75 percent

Average observed memory usage divided by memory request, 7 day window.

Idle node hours per week

Under 5 percent of total node hours

Hours where node CPU under 5 percent and memory under 10 percent. From Prometheus or Cloud Monitoring.

Spot or preemptible adoption

Above 30 percent of stateless workload node hours

Spot node hours used by stateless workloads divided by total stateless node hours.

Untagged pod ratio

Zero

Pods missing required labels (cost-center, owner, product) divided by total pods, scanned daily.

Cost per request

Tracked and trending down per release

Monthly namespace cost divided by total ingress requests served by that namespace.