How much does it cost in tokens to let Claude Code build a realistic TypeScript backend, and how does that cost vary by framework?
To put numbers on the question we ran the same claude-sonnet-4-6 agent on the same backend project (an HTTP API with persistence, a pub/sub event, a daily cron, and distributed tracing) across five frameworks (Encore, Express, Fastify, Hono, and NestJS) and across three runs (baseline, library-augmented, and rubric-graded), tracking token spend at every step. This article collects the cost-only view of the benchmark; the methodology, the rubric, and the rest of the findings are in Are TypeScript backend frameworks ready for AI agents?, and the raw transcripts are at github.com/encoredev/ai-backend-benchmark.
Each framework runs on its own VM with the same Postgres setup and the same claude-sonnet-4-6 model running through Claude Code. The agent works through three linked tasks (HTTP API and persistence, pub/sub and cron, tracing and production-readiness) and we repeat each framework three times for each run. We capture every artifact, including the per-task token usage Claude Code emits, and we compute cost at the published Sonnet 4.6 rate. Numbers below are medians across the three repeats unless stated otherwise. Run 3 was a single repeat per framework because we were looking for a directional read on whether grading the rubric in the tests moved the agent, not another full sweep.
| Framework | Total token cost across all three runs |
|---|---|
| Encore | $6.29 |
| NestJS | $12.69 |
Encore and NestJS are the two endpoints we report explicitly. Express, Fastify, and Hono sit between them; their full per-framework totals are derivable from the per-run transcripts in the benchmark repo.
In Run 1 Hono was the cheapest framework of the benchmark at $1.55 per run, Encore was at $1.96, and NestJS was the most expensive at $2.61 median with one repeat running to $4.45 (the highest single repeat on the baseline run).
In Run 2 every non-Encore framework regressed. With pg-boss, drizzle-kit, and pino pre-installed in the starter, the agent burned turns trying to integrate the libraries cleanly and the cost on every framework other than Encore went up. None of the four landed a first-try-green run across three repeats.
In Run 3, with the production-readiness rubric written into the test suite and a higher per-task turn budget, the costs spread further. Fastify hit every rubric check at $4.60 per run, Encore hit every check at $2.58 per run, and NestJS finished at 30 of 36 checks at $5.95 per run.
Encore's per-run cost barely moved across the three runs, because the production-readiness checks the test suite added in Run 3 were already encoded in the primitives the agent had used in Run 1. A retryPolicy: { maxRetries: 3 } on the existing Subscription satisfied the DLQ check; switching from console.log to encore.dev/log satisfied the structured-logging check; the numbered SQL migrations the agent had already written satisfied the versioned-migrations check.
The other four frameworks paid two distinct kinds of cost on top of the baseline. In Run 2 they paid library-integration cost, with the agent burning turns wiring pg-boss, drizzle-kit, and pino into the project and debugging when the wiring drifted. In Run 3 they paid rubric-iteration cost, with the agent making further changes to land the production-readiness checks on top of the library integration. NestJS additionally paid module-wiring cost in both runs, because cross-cutting changes in NestJS involve maintaining providers arrays and decorators that the agent gets wrong more often than it gets right.
The per-run differences in the benchmark project into a real monthly bill when a team runs this kind of agent-driven workflow daily. A team of 10 engineers running roughly one full backend feature per engineer per day at $2 per run lands at around $440 per month; the same team on a framework that costs $4 per run lands at around $880, and the more expensive framework is also the one whose first-draft code is less likely to be safe to deploy without further intervention.
Clone the repo, run the benchmark on your own infrastructure, or recompute against a different model: github.com/encoredev/ai-backend-benchmark.