TL;DR
- Most enterprise AI dashboards measure the wrong six numbers. Token volume, vendor mix, per-seat licenses, pilot count, accuracy percentage, and a synthetic 'AI ROI' line item — every one of them misleads a board.
- The right six are cost per verified outcome, route win rate, cache leverage ratio, locality hit rate, direct allocation percentage, and quality-adjusted margin. Each is a single number. Each is actionable. Each moves quarter over quarter if the platform is working.
- The cost per verified outcome is the only north-star metric that survives contact with a CFO. Tokens per million is a procurement input; cost per verified outcome is the operating result.
- Three archetypes — Buyer, Operator, Compounder — explain why two organizations with identical AI budgets can produce order-of-magnitude-different economics thirteen weeks later. The gap is operating discipline, not technology.
- Screenshot the scorecard, paste it into the next board deck, and the AI conversation in the room stops being about pilots and starts being about performance.
A CFO messaged me on LinkedIn a few weeks after The CEO's Guide to Token Economics went up. She did not want a full consultation; she wanted to send me one picture. It arrived in the reply: a photograph of a single page pulled out of her board deck, twelve bullet points long, titled "AI Dashboard." Her caption was one line. "I cannot defend a single number on this page."
Total token volume. Vendor mix across five providers. Per-seat license counts for three different copilot products. The number of active pilots. An aggregate "accuracy" score that nobody on the earnings-prep call had been able to explain. And, at the bottom, an "AI ROI" line item so synthetic it might as well have been hand-drawn.
On the follow-up call a week later, I asked her what she wanted instead. She said: "I want six numbers I can read in thirty seconds and defend to my board for thirty minutes."
That is this post.
The CEO's Guide to Token Economics argued that cost per verified outcome is the north-star metric. Data Gravity Meets Token Economics argued that placement is the other half of the discipline. Designing the AI Control Plane argued that the platform has to own the schema that makes any of this measurable. This post is the scorecard — the specific six metrics the control plane should emit, the trajectory they move along when the platform is actually being run, and the one-page replacement for the AI slide that currently confuses your board.
It is written for a CFO, a CEO, and a board member. Nothing in this post requires a PhD to read. Everything in this post requires a decision to implement.
The wrong six
Before the right six, the six that almost every enterprise is currently reporting — and the reason each one misleads.
Total token volume. Tells you how much AI was consumed. Tells you nothing about value. A platform that doubled token volume while outcomes stayed flat is losing money faster, not winning more work.
Vendor mix by spend. Tells the procurement team where the contracts sit. Tells the board nothing about whether the money is being well spent. Spend concentrated with one vendor could mean discipline or capture; spend diffused across seven could mean optionality or chaos. The metric cannot tell you which.
Per-seat license count. Measures tool adoption at a surface level. Ignores the fact that the heaviest users often consume ten to fifty times the tokens per seat that the median user does, and that a license without an agent behind it is a cost without a workload.
Pilot count. The classic vanity metric. Pilots are cheap. Scale is expensive. A board that is reading pilot counts is reading a number that peaks eighteen months before the bill does.
Aggregate accuracy percentage. Almost always a misleading blend across workloads with wildly different quality floors. A 92% accuracy on a mix of "summarize this email" and "redline this contract" tells you nothing about whether either workload is working. This is an engineering metric published as a business metric.
AI ROI. Usually calculated by a subtraction: saved FTE hours minus AI spend. The denominator is soft. The numerator is guessed. The quotient is a line item the CFO learns to distrust inside one quarter. If you cannot explain a number cold to a skeptical audit partner, stop using it.
The problem with all six is the same: they measure the appearance of doing AI, not the discipline of operating it. The six that matter measure the discipline.
The right six
The enterprise token scorecard
Six numbers the CFO should be able to read in thirty seconds. Every one is actionable. Every one moves quarter over quarter if the platform is working.
Definition
Fully loaded cost — tokens, context, rework, review, placement premium — divided by the count of verified outcomes the workload produced. The north-star metric in enterprise AI.
Why it matters
The only number that tells an executive whether AI is creating value. Tokens per million is a procurement input; cost per verified outcome is the operating result. If you can move one metric, move this one.
How to move it
Cache more, route better, shrink rework, tighten evals. Every one of the other five KPIs contributes to this number. Fix the leaks; this moves.
Screenshot this scorecard. Paste it into the next board deck. Replace the page of bullet-point AI pilot updates with these six numbers and a trend line, and the AI conversation in the room changes.
One visual. Six numbers. Click any of them to read the definition, the reason it matters, and the specific operating moves that improve it. Every single one is calculable today from the run ledger the control plane is already emitting — if the control plane is a real platform rather than a pile of plumbing.
Walk through them briefly.
Cost per verified outcome. The north star. Fully loaded cost — tokens, context, rework, human review, placement premium — divided by the count of outcomes the workload actually delivered at the workload's quality floor. This is the number the CEO asks about. This is the number the CFO defends to the board. Every other metric on the scorecard is subordinate to it.
Route win rate. The share of requests the router sent to a utility-tier model without an eval-driven escalation. If this number is under 50%, the platform is paying frontier prices for middle-tier work. If it is above 70%, the router is doing its job. No other metric moves cost per outcome faster.
Cache leverage ratio. The share of input tokens served from cache. Major providers discount cached input to 10% of standard price. A mature operator sits above 40%. Most enterprises are in single digits because no team owns system prompts and policy blocks as a shared asset rather than as per-app strings.
Locality hit rate. The share of regulated or sovereignty-bound workloads that ran in the correct residency zone, by policy. A number below 90% is a regulatory disclosure waiting to happen. Close to 100% is a defensible answer to any auditor.
Direct allocation percentage. The share of AI spend allocated to a named business-domain owner in FOCUS-compliant records. When this number is high, every domain is on the hook for the spend it generates. When it is low, AI spend hides in shared infrastructure and nobody owns outcomes.
Quality-adjusted margin. Margin on AI-delivered work, weighted by the share of outputs that passed the workload's eval floor. The metric that keeps the platform honest. You cannot book savings on work that had to be redone — and this is the single number that prevents the platform from doing so.
None of these are exotic. Every one of them should be emitting from the INFERENCE_RUN table described in the architecture post. If any of them is not — if the CIO says "we do not collect that field" — that answer is itself a finding.
The trend line, not the snapshot
One number in any quarter is an artifact. Six numbers moving together over a quarter is a pattern. The board should be reading both.
Thirteen weeks, six panels
The compressed version of the scorecard. One line, six sparklines, one quarter of trajectory. This is what belongs on the monthly board reporting rail.
Cost per outcome
Route win rate
Cache leverage
Locality hit
Direct allocation
QA margin
This is the compressed view. Six sparklines. Thirteen weeks. One glance.
The shape of the lines matters more than any individual number. Cost per outcome sloping down. Route win rate, cache leverage, locality hit, and direct allocation all sloping up. Quality-adjusted margin clawing forward. A platform that is healthy across all six moves in a coordinated way — because the metrics are reinforcing. Cache leverage compounds into route win rate. Route win rate compounds into cost per outcome. Direct allocation compounds into accountability, which compounds into quality-adjusted margin.
A platform that is sick across the scorecard shows the opposite — cost trending up, cache and route flat, allocation sliding, margin quietly giving way. That pattern is the diagnostic. The organization does not have a model problem or a vendor problem. It has an operating-discipline problem. The scorecard is how you see it before the invoice tells you.
Three archetypes, thirteen weeks
The gap between two enterprises with identical AI budgets does not appear in a single number. It appears in the trajectory.
What getting better looks like
Three maturity archetypes, thirteen weeks of trajectory. Toggle the archetype to see how each KPI evolves when a platform is actually being run.
Cost is flat to down. Route mix shifts decisively toward utility tier as eval specs tighten. Cache leverage climbs into the 30s. Direct allocation clears 70%. The scorecard is on the board deck every month.
The gap between the Buyer line and the Compounder line is not a technology gap. It is an operating discipline gap, and it compounds quarter over quarter.
Three archetypes, each a composite of the operators I've been trading notes with over the last year. Toggle between them. Toggle between the KPIs. Watch how the lines diverge.
The Buyer. Cost trends up quarter over quarter. Route mix stays stuck on the biggest model. Cache leverage never crosses single digits. Direct allocation sits in a shared-infrastructure bucket. Board conversations about AI remain anecdotal — "we launched three pilots," "adoption is up," "engineering likes it" — because the numbers that would support or contradict those anecdotes do not exist. The Buyer is spending confidently and learning slowly.
The Operator. Cost is flat to down. Route mix shifts decisively toward utility-tier as eval specs tighten. Cache leverage climbs into the 30s. Direct allocation clears 70%. The scorecard is on the monthly board reporting rail. The Operator is running a platform; the economics move the way a platform's economics are supposed to move.
The Compounder. The most interesting archetype. Cost per outcome drops fastest of the three — not linearly, but with an acceleration, because each improvement reinforces the next. Route mix is fully automated. Cache leverage clears 50% as compiled context becomes an owned asset. Allocation is near-perfect. The economics improve faster than the business grows. This is the flywheel the Rent vs Own series is about — the organization has stopped renting everything and started owning the pieces that compound.
The technology is the same for all three. The models are the same. The vendors are the same. The difference is not what they buy. It is how they run.
What should be on the next board slide
Replace the current slide.
Five lines. One image. The first image is the scorecard above. The five lines are the ones most CEOs cannot defend today, which is precisely why they are the ones that establish credibility:
-
Cost per verified outcome this quarter, and the trend. The north-star number and its trajectory. If it is not trending down against a rising outcome volume, something is wrong with the platform.
-
Share of AI spend with a named domain owner. Direct allocation percentage. Tells the audit committee that AI is being governed, not merely consumed.
-
Route mix and cache leverage. The efficiency story in a sentence. "Seventy-one percent of requests ran on utility tier with thirty-eight percent cache leverage, saving $X relative to the default route" is a defensible sentence that means something.
-
Locality hit rate, with the regulation it is measured against. The compliance story, preempted. Auditors like boards that know this number without being asked.
-
Quality-adjusted margin, with the eval specs that produced it. The value-capture story, audit-proof. Savings booked on outputs that passed a published quality floor cannot be argued with later.
One slide. Six KPIs on the scorecard. Five defensible sentences. If the CIO cannot produce that slide today, the conversation with the board can be reframed as "here is what we will be able to produce by next quarter." That conversation is the point.
The CEO-to-CIO questions, again
Five questions an executive should feel able to ask the CIO cold. The answers are diagnostic.
-
What is our cost per verified outcome this quarter? If the answer is a token volume, the platform is not instrumented. If the answer is a number with a trend arrow, you are operating.
-
Which workloads improved on the scorecard this quarter, and which got worse? Delta-by-workload is the signal. An average is a lie; the shape of the distribution is the truth.
-
Who owns each of the six KPIs? Scorecards without owners are decorations. The CIO should be able to name a human for each number.
-
What would it take to move the worst KPI one band in ninety days? Tests whether the CIO actually believes the scorecard is actionable. A thoughtful answer involves specific workloads, specific trade-offs, and specific commitments.
-
What is our route win rate on the top ten workloads by volume? The question shows you know where the leverage is. The answer tells you whether the CIO does.
You do not need to be able to write the queries. You need to be able to read them when the CIO sends them.
The leadership move
Stop measuring what is easy. Start measuring what is governable.
The organizations that are going to win at enterprise AI over the next five years are not the ones with the most pilots, the most vendors, or the biggest per-seat license footprint. They are the ones whose scorecard reads like an operating system — every KPI a lever, every lever owned by a human, every human held to a trend line.
The difference between the Buyer and the Compounder is not luck. It is not budget. It is not model choice. It is a decision made in one room, at one board meeting, by one CEO: we will run AI against a scorecard, and the scorecard will be the scorecard.
Everything else follows.
This is the measurement post in the executive token-economics thread. The frame is in The CEO's Guide to Token Economics. The placement dimension is in Data Gravity Meets Token Economics. The architecture that emits these numbers is in Designing the AI Control Plane. The portfolio decisions the scorecard unlocks are in the Rent vs Own series. The technical substrate lives in the three-part Token Economy series. Every post in the thread meets in the scorecard — because you cannot run what you cannot measure.