The 10-minute LLM spend audit

ReasonRank Engineering · 2026-07-02

Most teams can't answer "what does each of our agents cost per month?" — not because it's hard, but because nobody has spent the ten minutes. Here is the procedure. No signup required for any of it.

Minute 0–2: pull the three numbers per workload

From your provider dashboard (OpenAI usage page, Anthropic console, or your gateway logs), for each distinct workload — support triage, extraction, summarization, whatever — grab:

Calls per month (requests, not tokens)
Average input tokens per call
Average output tokens per call

If you can't split by workload, that's finding #1: you're flying blind at exactly the resolution where the money decisions live. Tag your traffic per agent — a single metadata field — before anything else.

Minute 2–5: price each workload at list

Drop each workload's three numbers into the LLM cost calculator with the model it currently runs. You now have a monthly dollar figure per workload. Two patterns show up almost every time:

One workload dominates. LLM spend is usually power-law distributed across agents. The top one or two are the only ones worth optimizing this quarter.
A flagship model doing commodity work. Classification, routing, triage, and extraction tasks running on a frontier model is the single most common form of LLM overspend.

Minute 5–8: the one honest question per agent

For each of your top workloads, ask: "What would break if this ran on the current small tier?" Not "would it be worse" — what specifically breaks. If nobody can name a failure mode, that's not evidence it's safe; it's evidence nobody has tested it. Add the workload to the switch-candidates list.

In the calculator, compare your current model against the current small tier (the cross-provider comparison pages — for example claude-haiku-4-5 vs gpt-5.4-mini — are built for exactly this). The delta at your volume is the monthly price of not answering the question.

Minute 8–10: write down the number

The audit's output is one sentence per workload:

"Support triage: ~$3,900/mo on gpt-5.4; switching to gpt-5.4-mini would save ~$2,900/mo at list prices — IF quality holds. Untested."

That "IF quality holds. Untested." clause is the whole point. You've converted vague overspend anxiety into a ranked list of testable hypotheses, each with a dollar value attached.

What the ten minutes can't tell you

List-price math answers "what would a switch be worth." It cannot answer "does quality hold" (that takes your test cases and a real statistical gate), and it cannot answer "did the savings actually materialize" (that takes production verification). Those two steps are what ReasonRank automates — the audit above is what makes them worth running. If your ranked list has a workload worth more than a few hundred dollars a month, connect your traffic and test the hypothesis properly.