RING 2 . RING 3 . RING 4 . RING 5
AI Financial Operations
Treat tokens, models, and prompts as financial primitives. Forecast, attribute, and govern the spend.
I.Why this matters
AI cost is unlike any cost line a CFO has seen. The unit is a token, the unit price is volatile, the model can be changed by the vendor, and the consumer is often a product feature that has no historical baseline. Without a real cost per outcome model, AI spend looks like an open faucet. AI Financial Operations defines the unit, sets the budget per outcome, and gives every product team a per request bill.
II.Principles
- Cost per outcome, not cost per call. A summarization endpoint that costs 2 cents per call is fine if it replaces a 5 dollar human task; it is a disaster if it replaces a free database lookup.
- Attribute every token to a feature flag and a customer segment. If you cannot answer "who consumed those tokens", you cannot govern.
- Cache aggressively at the prompt level. Prompt caching alone typically removes 30 to 60 percent of input cost on production workloads.
- Route to the cheapest model that meets the quality bar. A small model that is good enough is always cheaper than a large model that is excellent.
- Treat fine tuning as a buy versus build decision. Compute the break even tokens before signing off.
- Forecast monthly with a confidence interval. AI workloads breathe with product launches; a single point forecast will mislead.
- Govern PII and IP at the prompt boundary. Cost discipline and data discipline share the same logging substrate.
III.KPIs
IV.The playbook spine
- Define the unit of work. For each AI feature pick the unit (call, document, image, decision) and write down the formula for cost per unit.
- Wrap every model call in a logged proxy. Log feature flag, customer segment, model, input tokens, output tokens, latency, quality outcome, cost.
- Land the data in BigQuery. One row per call. Partition by date. Cluster by feature.
- Build the monthly cost per outcome dashboard. One row per feature: spend, calls, outcomes, cost per outcome, week over week trend.
- Install prompt caching at the gateway. Hash by stable prompt prefix. Measure cache hit rate weekly.
- Set a model routing policy. Default to the smallest model that passes a published quality eval; escalate only on flagged classes.
- Negotiate commit pricing. Once you have three months of stable consumption, sign a committed use discount or token reservation against the dominant model.
V.Common failures
- Logging tokens but not features. The bill is correct, but you cannot tell which product caused it.
- Defaulting to the most capable model for every call. The "use the best model" reflex costs more than it returns 80 percent of the time.
- Skipping eval when you switch models. Cost dropped, quality silently dropped further.
- Letting prompt drift balloon the input. A 200 token prompt that grows to 2000 over six months quietly multiplies cost.
- Not capping retries. A degraded model with 5 retries becomes 5x more expensive overnight.
- Treating fine tuning as a free win. Fine tuning amortizes only above a usage threshold; below it, the labor and infra cost dwarfs any price reduction.
- Letting PII land in logs because the cost team owns the proxy and the privacy team does not.
VI.Recommended tooling
Vendor neutral. For graded vendor comparisons see the Matrix.
AI gateway and proxy
Prompt cache layer
Model router and ensemble manager
Token attribution and chargeback
Eval and quality monitoring
PII redaction and policy enforcement
Forecast and anomaly detection
Negotiation and commit management
VII.Related IFO4 playbooks
- ai-cost-per-inference . coming soon
- Tier ACarbon-Aware ML Training
VIII.FAQ
Is "cost per token" enough?
No. Cost per token is the wholesale price. The number a CFO can act on is cost per outcome (per ticket resolved, per document summarized, per decision automated).
How do I forecast a brand new feature?
Use a Fermi estimate based on expected DAU times calls per session times tokens per call. Then put a confidence interval on it that is wide enough to be honest. Refit weekly until you have one month of real data.
Should I self host?
Self hosting is rarely cheaper at small or medium scale once you account for engineering time, GPU underutilization, and on call. It becomes interesting above roughly two billion tokens per month or where data residency makes API consumption infeasible.
How do I justify the gateway investment?
Show the prompt cache hit rate it unlocks (typical first three months: 30 to 50 percent of input tokens removed) and the model routing savings (typically 20 to 40 percent). Both numbers are large and reversible if the gateway underperforms.
Do I need a separate AI budget line?
Yes. Burying AI cost inside compute or SaaS prevents the CFO from doing capacity planning. Carve out a line, even if it sits inside R and D.
How do I know when to fine tune?
Fine tune when the input prompt is large and stable, the volume is high, and a smaller fine tuned model can replace a larger model on the same eval. Run the math: training cost amortized over expected volume must beat the difference in inference cost.
How do I handle model deprecation?
Build the eval before you build the feature. When the vendor announces deprecation, run the eval against the replacement model and re negotiate the routing policy. Keep at least one fallback model approved at all times.
Take this to your CFO