All postsJune 10, 2026

I Scored an F on My Own AI Coding Audit

My tool said I was wasting $1,746 a month. Then I looked closer and found out how easy it is to fool yourself with a number.

I build a tool that scores how efficiently you use AI coding assistants. So one afternoon I pointed it at my own last 30 days. I figured I would land somewhere in the B range. I cache aggressively, I reach for Sonnet on simple tasks, I keep my context windows tight.

The grade came back: F. Twenty out of a hundred.

The summary said I had spent about $6,400 across 73,000 API calls, and that roughly $1,746 of it, about 27 percent, was recoverable waste. (Those dollars are tokens priced at public API rates: real money on pay-as-you-go, or the equivalent value of what you used on a flat subscription.) My first reaction was the obvious one. A quarter of my spend, gone to nothing. That is a lot of money.

My second reaction, after I actually read the report, was more useful. Most of that $1,746 was not waste at all. And the gap between those two reactions is the whole point of this post.

Where the number came from

The $1,746 was not one big leak. It was a stack of findings, and one of them dwarfed the rest:

71 “low-worth expensive” sessions: $1,550. Sessions with real spend but weak delivery signals (no commits, lots of retries).
348 context-heavy sessions: $101. Sessions where input and cache tokens dwarfed the output.
6 high-cost outliers: $92.
Everything else (re-reading files, an unused MCP server, 41 skills I never invoke, reading build folders): under $5 combined.

So 89 percent of the “waste” was one bucket: those 71 sessions. If that bucket was real, I had a real problem. If it was not, the headline number was fiction. So I opened it.

The biggest “waste” was an experiment I run on purpose

The five most expensive flagged sessions:

$884, 263 retries
$424, 59 retries
$422, 52 retries
$397, 600 retries
$349, 139 retries

Three of those five, including the $884 and the $397, were the same project: a memory-evaluation harness I run as a benchmark. It loops an agent over a large dataset by design. It is supposed to retry. It is supposed to be expensive. It produces no git commits because it is an eval, not a feature. Every signal that made the tool flag it as “low-worth” is exactly what I want that job to do.

That is not waste. That is the experiment. If I deleted those sessions to improve my score, I would be deleting the work.

Once I set the eval harness aside, the “$1,550 of low-worth sessions” shrank to a couple of genuinely runaway debugging sessions, the kind where I kept pasting errors back in at midnight instead of stopping to think. Real waste, worth fixing, but a few hundred dollars, not fifteen.

The lesson: a score flags candidates, it does not judge them

An automated efficiency score cannot tell the difference between money you meant to spend and money you wasted. It sees a session that cost $884 with 263 retries and no commits, and it flags it. It cannot know that the session was a benchmark doing precisely what benchmarks do.

This is true of every metric that grades you. Code coverage flags untested lines, but some of those lines are trivial getters. A linter flags every rule violation, but some of them are deliberate. The tool surfaces candidates. You decide which ones are real. The moment you treat the headline number as a verdict instead of a starting point, you start optimizing for the score instead of the work.

So the honest version of my result is not “I wasted $1,746.” It is “the tool flagged $1,746, I looked, and roughly $1,300 of it was an eval harness I run on purpose. The recoverable part was real but small, and it lived in a different place than the headline pointed.”

What was actually worth fixing

The genuine waste was not in the big dramatic sessions. It was in the quiet $101 bucket: context-heavy sessions. Runs where I had 40 million tokens of input and cache flowing in against a hundred thousand tokens of output. The model was re-reading a mountain of context to produce a molehill of code.

Three changes did most of the work:

Start fresh more often. A lot of my context bloat came from extending old sessions. The conversation accumulated files and failed attempts, and every new request dragged all of it along. Starting a new session with only the current goal and the relevant files cut the per-request context sharply.
Stop runaway debugging sooner. Both genuinely wasteful sessions followed the same shape: I kept retrying past the point where the model had anything new to offer. Now I treat a string of failed retries as a signal to read the code myself, not to paste the error in again.
Scope the MCP servers. One server was loaded into dozens of projects that never called it, adding its schema to every request. Project-scoping it was a 30-second change. It saves almost nothing per session, but across hundreds of sessions it adds up, and it cost nothing to fix.

Why this is worth your time anyway

You might read all this as a reason to ignore efficiency scores. It is the opposite. The score was wrong about the size of my problem, but it pointed me at the right files. Without it I never would have noticed the context-heavy pattern, because cost is invisible while you work. There is no meter ticking in the corner of your editor. The bill arrives weeks later as one lump sum, with no breakdown.

The value was not the grade. It was being forced to look at my own sessions and ask, one by one, “did I mean to spend this?” Most of the time the answer was yes. A few times it was no, and those few times were worth finding.

If you want to run the same audit on your own usage, it is one command and nothing leaves your machine:

npx codeburn optimize

Read the findings, then do the part the tool cannot: separate the money you meant to spend from the money you did not. The grade is just where the conversation starts.