RING 2 . RING 3 . RING 4 . RING 5

AI Financial Operations

Treat tokens, models, and prompts as financial primitives. Forecast, attribute, and govern the spend.

I.Why this matters

AI cost is unlike any cost line a CFO has seen. The unit is a token, the unit price is volatile, the model can be changed by the vendor, and the consumer is often a product feature that has no historical baseline. Without a real cost per outcome model, AI spend looks like an open faucet. AI Financial Operations defines the unit, sets the budget per outcome, and gives every product team a per request bill.

II.Principles

Cost per outcome, not cost per call. A summarization endpoint that costs 2 cents per call is fine if it replaces a 5 dollar human task; it is a disaster if it replaces a free database lookup.
Attribute every token to a feature flag and a customer segment. If you cannot answer "who consumed those tokens", you cannot govern.
Cache aggressively at the prompt level. Prompt caching alone typically removes 30 to 60 percent of input cost on production workloads.
Route to the cheapest model that meets the quality bar. A small model that is good enough is always cheaper than a large model that is excellent.
Treat fine tuning as a buy versus build decision. Compute the break even tokens before signing off.
Forecast monthly with a confidence interval. AI workloads breathe with product launches; a single point forecast will mislead.
Govern PII and IP at the prompt boundary. Cost discipline and data discipline share the same logging substrate.

III.KPIs

Name	Target	Computation
Cost per million input tokens by feature	Tracked and budgeted per feature	Sum of input tokens times unit price by model, attributed by feature flag, divided by 1 million. Roll up monthly.
Cost per million output tokens by feature	Tracked and budgeted per feature	Sum of output tokens times unit price by model, attributed by feature flag, divided by 1 million.
Prompt cache hit rate	Above 50 percent for high frequency endpoints	Cached input tokens divided by total input tokens, per endpoint, weekly.
Cost per successful outcome	Defined per feature, trending down per release	Monthly AI spend per feature divided by count of successful outcomes that feature drove.
Model fallback rate	Below 5 percent or matching SLO budget	Calls that failed quality gate and were retried on a larger model divided by total calls.
Forecast accuracy month over month	Within 15 percent at p50	Absolute error of monthly forecast versus actual, by feature.
PII redaction coverage	100 percent of regulated fields	Audited prompts that passed redaction divided by total audited prompts; sample weekly.

IV.The playbook spine

Define the unit of work. For each AI feature pick the unit (call, document, image, decision) and write down the formula for cost per unit.
Wrap every model call in a logged proxy. Log feature flag, customer segment, model, input tokens, output tokens, latency, quality outcome, cost.
Land the data in BigQuery. One row per call. Partition by date. Cluster by feature.
Build the monthly cost per outcome dashboard. One row per feature: spend, calls, outcomes, cost per outcome, week over week trend.
Install prompt caching at the gateway. Hash by stable prompt prefix. Measure cache hit rate weekly.
Set a model routing policy. Default to the smallest model that passes a published quality eval; escalate only on flagged classes.
Negotiate commit pricing. Once you have three months of stable consumption, sign a committed use discount or token reservation against the dominant model.

V.Common failures

Logging tokens but not features. The bill is correct, but you cannot tell which product caused it.
Defaulting to the most capable model for every call. The "use the best model" reflex costs more than it returns 80 percent of the time.
Skipping eval when you switch models. Cost dropped, quality silently dropped further.
Letting prompt drift balloon the input. A 200 token prompt that grows to 2000 over six months quietly multiplies cost.
Not capping retries. A degraded model with 5 retries becomes 5x more expensive overnight.
Treating fine tuning as a free win. Fine tuning amortizes only above a usage threshold; below it, the labor and infra cost dwarfs any price reduction.
Letting PII land in logs because the cost team owns the proxy and the privacy team does not.

VI.Recommended tooling

Vendor neutral. For graded vendor comparisons see the Matrix.

AI gateway and proxy

Prompt cache layer

Model router and ensemble manager

Token attribution and chargeback

Eval and quality monitoring

PII redaction and policy enforcement

Forecast and anomaly detection

Negotiation and commit management

VII.Related IFO4 playbooks

ai-cost-per-inference . coming soon
Tier ACarbon-Aware ML Training

VIII.FAQ

Is "cost per token" enough?

No. Cost per token is the wholesale price. The number a CFO can act on is cost per outcome (per ticket resolved, per document summarized, per decision automated).

How do I forecast a brand new feature?

Use a Fermi estimate based on expected DAU times calls per session times tokens per call. Then put a confidence interval on it that is wide enough to be honest. Refit weekly until you have one month of real data.

Should I self host?

Self hosting is rarely cheaper at small or medium scale once you account for engineering time, GPU underutilization, and on call. It becomes interesting above roughly two billion tokens per month or where data residency makes API consumption infeasible.

How do I justify the gateway investment?

Show the prompt cache hit rate it unlocks (typical first three months: 30 to 50 percent of input tokens removed) and the model routing savings (typically 20 to 40 percent). Both numbers are large and reversible if the gateway underperforms.

Do I need a separate AI budget line?

Yes. Burying AI cost inside compute or SaaS prevents the CFO from doing capacity planning. Carve out a line, even if it sits inside R and D.

How do I know when to fine tune?

Fine tune when the input prompt is large and stable, the volume is high, and a smaller fine tuned model can replace a larger model on the same eval. Run the math: training cost amortized over expected volume must beat the difference in inference cost.

How do I handle model deprecation?

Build the eval before you build the feature. When the vendor announces deprecation, run the eval against the replacement model and re negotiate the routing policy. Keep at least one fallback model approved at all times.

IX.Further reading

Take this to your CFO

Compute your Score V2, assess your maturity, and prove the practice in lab.

Score V2 Maturity assessment Buy a Proven Lab attempt for this topic (PROV-AI)

Name

Target

Computation

Cost per million input tokens by feature

Tracked and budgeted per feature

Sum of input tokens times unit price by model, attributed by feature flag, divided by 1 million. Roll up monthly.

Cost per million output tokens by feature

Tracked and budgeted per feature

Sum of output tokens times unit price by model, attributed by feature flag, divided by 1 million.

Prompt cache hit rate

Above 50 percent for high frequency endpoints

Cached input tokens divided by total input tokens, per endpoint, weekly.

Cost per successful outcome

Defined per feature, trending down per release

Monthly AI spend per feature divided by count of successful outcomes that feature drove.

Model fallback rate

Below 5 percent or matching SLO budget

Calls that failed quality gate and were retried on a larger model divided by total calls.

Forecast accuracy month over month

Within 15 percent at p50

Absolute error of monthly forecast versus actual, by feature.

PII redaction coverage

100 percent of regulated fields

Audited prompts that passed redaction divided by total audited prompts; sample weekly.