05/26/26

Best TypeScript Backend Framework for AI Agents (2026)

Five frameworks, the same coding agent, the same tests, and a 36-check production-readiness rubric. Which framework ships code an AI agent can build production-ready in one pass?

10 Min Read

Which TypeScript backend framework is best to use with an AI coding agent like Claude Code?

To answer that we took a single agent, gave it the same realistic backend project (an HTTP API with persistence, a pub/sub event, a daily cron, and distributed tracing), and ran it on five frameworks (Encore, Express, Fastify, Hono, and NestJS) using the same prompts, the same model, the same Postgres setup, and the same VM. We then read the diffs and scored every framework against the same production-readiness rubric. The short answer is that on every framework other than Encore the agent built the laziest solution that would still pass the tests, and the only framework where the first draft was actually safe to deploy was Encore.

The full benchmark, prompts, starters, and transcripts are at github.com/encoredev/ai-backend-benchmark, and the long-form writeup of the methodology and findings lives in Are TypeScript backend frameworks ready for AI agents?.

Headline numbers

FrameworkRun 1 (baseline) testsRun 3 rubric (out of 36)Run 3 costTotal cost across three runs
Encore31/3136/36$2.58$6.29
Fastify31/3136/36$4.60~$10
Express31/3135/36not reported~$9
Hono31/3129/36not reported~$8
NestJS31/3130/36$5.95$12.69

Every framework hit a green test suite on the baseline run. The picture diverged when we asked what those green test suites were actually green on top of.

How we tested it

Each framework gets its own VM with the same Postgres setup and the same claude-sonnet-4-6 model running through Claude Code. The agent works through three linked tasks: t1 (HTTP API and persistence), then t2 (extend t1 with pub/sub and cron, starting from t1's working directory), then t3 (extend t2 with distributed tracing and production-readiness, starting from t2's). The tests are plain black-box HTTP probes run with vitest, and they are the same against every framework. The agent never reads the tests. Each starter is whatever its scaffolding command produces, which means Encore's CLAUDE.md and MCP server are in scope because we wanted the comparison to be between the five frameworks as they arrive today, not against a normalized minimum.

We ran the benchmark three times to make sure the result was not an artifact of one specific test design. Run 1 was the baseline with each framework's default scaffold and three repeats per framework. Run 2 was the same prompts but with pg-boss (durable jobs and cron), drizzle-kit (migrations), and pino (structured logs) pre-installed into the four non-Encore starters with a README explaining each, three repeats per framework. Run 3 was Run 2's starters plus five production-readiness tests in the suite, one repeat per framework on a higher per-task turn budget.

The five production-readiness checks

The Run 3 rubric we added to the test suite asks for five things every backend in this shape should be doing in production:

  1. Multi-instance-safe scheduling. The daily aggregation cron must use a shared coordination layer (Encore CronJob, pg-boss schedule, BullMQ repeatable jobs, @nestjs/schedule with a coordination layer). Raw setInterval or setTimeout inside the application process is not acceptable.
  2. Pub/sub retry policy. The notifications subscription must declare a finite retry policy. After the configured maxRetries, failed messages must move to a dead-letter queue or equivalent terminal failed state.
  3. Failed-messages observability endpoint. GET /notifications/_failed returns the records that exhausted retries; a POISON_TEST payload must end up there after enough failures.
  4. Versioned schema migrations. All schema changes expressed as numbered migration files using a migration tool (Encore service migrations, Knex, Drizzle, TypeORM). CREATE TABLE IF NOT EXISTS at app startup is not acceptable.
  5. Structured logging. Application logs must be valid JSON with at minimum a level and a timestamp field. No console.log for application events.

What the agent built on each framework

On the baseline run every framework hit 31 of 31 tests, but the diffs underneath those green test suites diverged sharply.

Express, Fastify, Hono, NestJS

On the four non-Encore frameworks the agent converged on near-identical implementations of every async primitive in different file layouts. Pub/sub was a Postgres queue table polled on a timer:

await pool.query(
  `INSERT INTO event_queue (event_type, payload, status) VALUES ($1, $2, 'pending')`,
  ['order-created', { order_id: id }]
);

setInterval(async () => {
  const { rows } = await client.query(`
    SELECT * FROM event_queue
    WHERE status = 'pending' AND next_retry_at <= NOW()
    ORDER BY id LIMIT 1 FOR UPDATE SKIP LOCKED
  `);
  // process, mark done, or bump retry counter
}, 500);

The daily cron was a setTimeout chain inside the application process. The schema was CREATE TABLE IF NOT EXISTS plus the occasional ALTER TABLE ... ADD COLUMN IF NOT EXISTS at boot, with no migration history. For tracing the agent added a request_id column to the orders table and threaded the id through every function argument.

Encore

On Encore the same agent given the same prompts reached for the framework's primitives:

// orders/events.ts
export const orderCreated = new Topic<OrderCreatedEvent>("order-created", {
  deliveryGuarantee: "at-least-once",
});

// notifications/notifications.ts
new Subscription(orderCreated, "send-notification", {
  handler: async (event) => { /* ... */ },
});

// jobs/jobs.ts
const _ = new CronJob("daily-aggregation", {
  every: "24h",
  endpoint: runDailyAggregation,
});

Schema migrations went into numbered SQL files (migrations/1_create_orders.up.sql, migrations/2_add_request_id.up.sql) that Encore tracks in a _migrations table and applies in order on every deploy. For the tracing task the agent did not thread request_id through function arguments because Encore's runtime propagates the correlation id automatically across the typed cross-service call, the subscription handler, and the cron invocation.

Why those implementations aren't equivalent

What the agent built on the four non-Encore frameworksProduction weaknessWhat the agent built on Encore
Postgres queue polled by setIntervalThe application database doubles as the event bus, and there is no dead-letter destination, so a poison message retries forever and can block everything behind it.Topic with at-least-once delivery, retries and a platform-managed DLQ configured at the framework level.
setInterval cron in the application processFires once per replica. Three replicas run the daily aggregation three times per tick. The agent's INSERT ... ON CONFLICT DO UPDATE saved this app; any non-idempotent cron would fire three times for real.CronJob declared at the framework level, invoked once per tick across the fleet by an external scheduler.
CREATE TABLE IF NOT EXISTS at bootNo migration history. Column renames and NOT-NULL backfills have no version to roll back to.Numbered SQL migrations tracked in _migrations, applied in order on every deploy.

None of these guarantees show up in Run 1's test suite, which is why the test suite alone was not a satisfying answer to the question.

What pre-installed libraries did to the other four frameworks

In Run 2 we added pg-boss, drizzle-kit, and pino to each non-Encore starter with a README explaining what each library was for, against the same test suite as Run 1. Every framework other than Encore regressed. None of them landed a first-try-green run across three repeats. Task 1 was largely fine; the agent fell apart on tasks 2 and 3, where it had to integrate the new libraries with the application code it inherited from t1. On Express and Fastify the agent registered a pg-boss scheduled job without first creating the queue and the server crashed at boot with Queue daily-aggregation not found. On NestJS the agent imported pg-boss into a NotificationsService but did not register the wrapping PgBossService in the module's providers array. On Hono the failures spread across all four pre-installed libraries.

What writing the rubric into the tests did

Adding the libraries in Run 2 was not enough on its own, so in Run 3 we made production-readiness part of what the test suite graded directly. The agent now had an explicit target to iterate against and we raised the per-task turn cap from 80 to 200.

Encore reached 36/36 with one primitive per check. The rubric additions landed as small changes against the same primitives the Encore agent had used in Run 1: a retryPolicy: { maxRetries: 3 } on the existing Subscription, encore.dev/log for structured logging, and Encore service migrations for the schema-versioning check. Fastify was the cleanest non-Encore result: the agent pulled pg-boss for pub/sub and cron, drizzle-kit for migrations, pino for logs, and hit green on every check at $4.60 against Encore's $2.58. Express came one test short, having implemented every check except migrations. Hono and NestJS had wider regressions: Hono tracing failed across the run because the agent had keyed spans on order_id rather than request_id, and NestJS additionally shipped a TypeScript error on the unmodified portion of the starter that broke the typecheck probe.

Cost

Across all three runs combined, Encore came in cheapest at $6.29 of token spend, NestJS came in most expensive at $12.69, and the other three sat between them. The shape of the cost line matters as much as the totals. Encore's per-run cost barely moved across the three runs because the production-readiness checks were already encoded in the primitives the agent used in Run 1. The other four climbed in Run 2 (library integration) and again in Run 3 (rubric iteration). For a team running this kind of agent-driven workflow daily, the per-run difference projects to a real monthly bill.

What the comparison actually shows

The framework's defaults shape what the agent ships, more than any prompt or rubric we layered on top. Pre-installing the right libraries in Run 2 did not close the gap; if anything it widened, because the agent burned turns on integration bugs that the rubric did not directly grade. Writing the rubric into the tests in Run 3 closed the gap on Fastify and most of the way on Express, but only at a higher token cost and only on frameworks whose libraries the agent could compose without losing the thread.

If you are choosing a TypeScript backend framework in 2026 and AI is writing a meaningful share of your code, weight what the framework hands the agent out of the box more heavily than you used to, and weight router ergonomics less. Integrated primitives beat library compositions for an agent: frameworks that gave the agent a single primitive for durable events, cron, and migrations finished with fewer broken tests, lower token cost, and production semantics already in place.

When each framework is the right choice

Encore is the right choice if AI is writing a meaningful share of your backend code, if you want production guarantees on the first agent pass, and if you care about per-iteration token cost across a team workflow.

Fastify is the right choice if you want the closest non-Encore experience from an AI agent and you accept the roughly 2x token premium to reach the same production checks, or if you already run Fastify in production with a tuned plugin stack you do not want to rebuild.

Express is reasonable for small APIs where a human will do the production hardening, or for existing Express codebases with deep internal middleware you do not want to replace.

Hono is the right choice for edge-only or serverless-first APIs where production-readiness lives at the platform layer (Cloudflare Workers, Vercel, Bun).

NestJS is the right choice if your team has deep Angular or NestJS muscle memory and is willing to absorb a roughly 2x token premium when an agent extends the codebase, or if you have an existing NestJS monolith you are not planning to migrate away from.

Reproduce the benchmark

Clone the repo, point it at your own framework, or rewrite the rubric to match your own definition of production-ready: github.com/encoredev/ai-backend-benchmark.

Ready to build your next backend?

Encore is the Open Source framework for building robust type-safe distributed systems with declarative infrastructure.