Projected vs verified savings: how vendors fudge LLM cost numbers

ReasonRank Engineering · 2026-07-02

Every LLM cost tool shows you a dollar figure. Almost none of them will tell you where it came from. Having built (and audited) this math ourselves, here are the four places a savings number goes quietly wrong — each one something we had to fix in our own pipeline before we'd show the number to a customer.

1. The apples-to-oranges baseline

The most common fudge: compare the new model's production cost against the old model's eval-suite cost. Eval suites have short, curated prompts; your production traffic has 4k-token contexts and retries. The comparison flatters whichever side has the smaller tokens.

Honest version: production versus production. Snapshot the old model's trailing production cost-per-call before the switch, and compare it to the new model's production cost-per-call after. An eval-to-production comparison is a projection — useful, but it must never be labeled "verified."

2. Savings "verified" on three calls

If a tool reports realized savings, ask: on how many calls? A cost-per-call average over a handful of traces is noise wearing a suit. We refuse to report realized savings until at least 20 post-switch calls ran on the new model — an intentionally low floor that still filters out the most embarrassing failure mode, and the sample count ships with the number so you can apply your own standard.

3. Nobody checked the switch actually happened

The quietest lie of all. A customer clicks "apply," the dashboard starts counting savings — but the code still pins the old model. If post-switch cost went down because traffic dipped, the tool happily attributes it to a switch that never shipped.

Honest version: match post-switch traces against the new model's name, compute the share of traffic that actually moved, and withhold every realized dollar until the switch is detected in live traffic. Our UI shows this as an explicit "on hold — traffic still on the old model" state, with instructions to fix it, because silently counting is worse than admitting the switch didn't ship.

4. Prices that were true last spring

Cost math inherits the freshness of its price table. Providers reprice models several times a year; a savings projection computed from stale prices is fiction with decimal places. Two honest mechanisms: show the as-of date next to every dollar figure, and fail loudly when the table goes stale. (Our CI has a tripwire test that fails the build when prices haven't been re-verified in 60 days — the whole product's credibility rests on that table.)

The test to apply to any vendor (including us)

Ask for one screenshot: a savings figure with its evidentiary basis attached. Not a dashboard total — one number, with the baseline, the sample, and the detection state visible. Tools that have the receipts show them by default.

You can check the price-freshness half of this right now: our model price pages carry their as-of date on every figure, and the cost calculator uses the identical table the product uses in production.