May 20, 2026

Are TypeScript backend frameworks ready for AI agents?

We set out to run one benchmark across five TypeScript backend frameworks. Reading the diffs sent us into two more runs, and the picture changed each time.

14 Min Read

How well does an AI coding agent build TypeScript backends across popular frameworks?

We took the same agent (Claude Code), gave it the same realistic backend tasks, and ran it on five frameworks (Encore, Express, Fastify, Hono, NestJS). Same prompts, same model, same Postgres setup, same VM. We captured every artifact of every run and scored them all on one rubric.

After the first run every framework's tests passed. The post we were going to publish was roughly "lighter frameworks finished cheaper, NestJS paid a ceremony tax."

Then we read the diffs and on four of the five frameworks the agent had built the laziest solution that would still pass the tests: a Postgres table polled by setInterval for the durable queue, CREATE TABLE IF NOT EXISTS at boot instead of any migration system. The test suite wasn't telling us whether the code was shippable, so we did two more runs to find out.

run 1 · same code, two yardsticks
After the first run, every framework's tests passed, but only Encore's code held up against a production-readiness rubric.
Functional tests passed31-test suiteProduction-readysame Run 1 code, scored against the 5-check rubricExpress31/3120%Fastify31/3120%Hono31/3120%NestJS31/3120%Encore31/31100%

Left bars show the 31-test functional suite (everyone hit it). Right bars score the same Run 1 code against a 5-check production-readiness rubric: versioned migrations, multi-instance-safe cron, retry policy with a DLQ, a failed-message endpoint, and structured logging. Encore's framework primitives already encode the rubric, so the agent reaches it as a side effect of using the framework. The other four don't, and the next two runs try to close that gap.

Every framework reached a green test suite in Run 1, but only Encore's output was production-ready when we scored it against a 5-check rubric (versioned migrations, multi-instance-safe cron, retry + DLQ, failed-message endpoint, structured logging). We spent the next two runs trying to get the other four there too, first by pre-installing the libraries the agent should have used, then by writing the rubric directly into the test suite.

The rest of the post is what we found, run by run. Everything is reproducible: full repo, prompts, starters, and transcripts at github.com/encoredev/ai-backend-benchmark.

How we tested it

what the agent has to build
A small but realistic backend.

Two services with a typed cross-service call, a durable event, a daily cron, a Postgres-backed store, and request-id tracing across the lot. The three tasks layer these on in order: t1 builds the API and the typed call, t2 adds the event, subscriber, and cron, t3 adds tracing.

HTTP clientBearer tokenPOST /ordersOrders servicePOST / GET /orderst1Payments servicecharge(idempotency_key)t1typed callNotificationssubscriber + cront2order-created eventdaily cron→ order_totalsPostgresorders · notifications · order_totals · spans
t1 builds the orders + payments API and the typed callt2 adds the order-created event, subscriber, and daily cront3 adds X-Request-Id tracing across every service boundary

Each framework gets its own exe.dev VM with an identical environment. The agent works through three tasks back to back: t1, then t2 (which starts from t1's workdir), then t3 (from t2's). That linkage matters: failure modes that only show up when extending existing code would not surface in a single-task version, and we would have missed the Run 2 regression entirely.

The tests are plain black-box HTTP assertions against the live server, run with vitest, and they're the same probes against every framework. The prompts the agent sees are framework-agnostic, the tests don't read the source tree, and every artifact (prompts, test files, transcripts, diffs) lives in the repo so anyone can re-run on a different framework or rewrite the tests to suit their own definition of production-ready. We used claude-sonnet-4-6 because that's what Claude Code defaults to today; whether a different model would change the framework ordering is its own benchmark.

We didn't set out to run three. Run 1 was the whole plan; Runs 2 and 3 are what we built after we read Run 1's diffs and went looking for ways to close the gap. Each run answers a question of the form "what if we did X to make the non-Encore frameworks perform better?":

  1. Run 1, the baseline. Each framework with its default scaffold, three repeats per framework.
  2. Run 2, what if we add the right libraries to the project and tell the agent to use them? pg-boss, drizzle-kit, pino, and friends added to the four non-Encore starters with a README explaining what each library is for, against the same test suite as Run 1, three repeats per framework.
  3. Run 3, what if the tests also checked for production-readiness? Same starters as Run 2, but the test suite now adds five checks for production-readiness (versioned migrations, multi-instance-safe cron, retry + DLQ, failed-message endpoint, structured logging). One repeat per framework, since we wanted a directional read on whether grading the rubric in the tests would move the agent, not another full sweep.

Each starter is whatever its scaffolding command produces. That includes Encore's .mcp.json and CLAUDE.md, since the comparison we wanted is "framework as it arrives in 2026" rather than a normalized minimum. Teams who have tuned rules and skills packs on top of Express will see different numbers. The five production-readiness checks in Run 3 are a small first set, not a full checklist; nothing here on graceful shutdown, pool sizing, secrets, or real distributed tracing backends.

Run 1: clean starters

Every framework hit 31/31 tests. Hono was the cheapest at $1.55 per run, NestJS the most expensive at a $2.61 median with one outlier repeat that ran to $4.45, Encore in the middle at $1.96. First-try-green ratios sat around 2/3 across the board, with Fastify hitting 3/3. The headline numbers weren't telling much of a story, but the diffs underneath them were.

run 1 · per-framework results · medians across 3 repeats
Every framework reached green. The numbers underneath look unremarkable.
Tests passed
out of 31
First-try green
out of 3 repeats
Total turns
across t1, t2, t3
Total cost
USD
Express31/312/3142$1.71
Fastify31/313/3133$1.60
Hono31/312/3134$1.55
NestJS31/312/3234$2.61
Encore31/313/3152$1.96

What the agent built on Express, Fastify, Hono, and NestJS

On those four frameworks, the agent converged on near-identical implementations of every async primitive the prompt asked for, in four different file layouts.

For pub/sub durability, it created a Postgres queue table and polled it on a timer.

// publisher await pool.query( `INSERT INTO event_queue (event_type, payload, status) VALUES ($1, $2, 'pending')`, ['order-created', { order_id: id }] ); // worker, run on setInterval(..., 500) or recursive setTimeout const { rows } = await client.query(` SELECT * FROM event_queue WHERE status = 'pending' AND next_retry_at <= NOW() ORDER BY id LIMIT 1 FOR UPDATE SKIP LOCKED `); if (!rows.length) { await client.query('ROLLBACK'); return; } try { await processEvent(rows[0]); await client.query(`UPDATE event_queue SET status='done' WHERE id=$1`, [rows[0].id]); } catch { await client.query( `UPDATE event_queue SET attempts=attempts+1, next_retry_at=NOW()+INTERVAL '5 seconds' WHERE id=$1`, [rows[0].id] ); }

For the daily aggregation cron, the agent wrote a setTimeout or setInterval running inside the application process, scheduled at app startup.

function startScheduler() { const scheduleNext = () => { const now = new Date(); const next = new Date(Date.UTC( now.getUTCFullYear(), now.getUTCMonth(), now.getUTCDate() + 1 )); setTimeout(() => { runDailyAggregation().catch(console.error); scheduleNext(); }, next.getTime() - now.getTime()); }; scheduleNext(); }

For schema, the agent called CREATE TABLE IF NOT EXISTS plus the occasional ALTER TABLE ... ADD COLUMN IF NOT EXISTS on app startup, with no migration tracking, so the schema was whatever db.ts happened to execute at boot. For tracing, it added a request_id column to the orders table, created a separate spans table, and threaded the request id through every function argument.

The four converged on the same overall shape but differed on small choices that mattered later. Fastify and Hono keyed spans on order_id rather than adding a request_id column to the orders table, which is the reason Hono's t3 failed in Run 3 when probed with an unknown order id.

What the agent built on Encore

Same agent, same prompt, different defaults. On Encore the agent declared the async work using the framework's primitives.

// orders/events.ts export const orderCreated = new Topic<OrderCreatedEvent>("order-created", { deliveryGuarantee: "at-least-once", }); // notifications/notifications.ts new Subscription(orderCreated, "send-notification", { handler: async (event) => { /* ... */ }, }); // jobs/jobs.ts const _ = new CronJob("daily-aggregation", { every: "24h", endpoint: runDailyAggregation, });

Schema migrations went into numbered SQL files per service (migrations/1_create_orders.up.sql, migrations/2_add_request_id.up.sql), which Encore tracks and applies in order. The typed cross-service call to payments came through ~encore/clients, which gives a compile-time contract on the request and response shapes.

For t3's tracing the agent didn't thread request_id through function arguments the way the four other agents did. Encore's runtime propagates the request's correlation id automatically across the typed call to payments, the Subscription handler, and the CronJob invocation, so the agent just imported encore.dev/log, called log.info(...) from the handlers, and read the runtime's tracing surface to populate /orders/:id/trace.

Why these implementations aren't equivalent in production

Both pass the test suite, but they behave very differently the moment you deploy them.

Design the agent reached forProduction weaknessEncore primitive
Postgres queue polled by setIntervalApplication DB doubles as the event bus. No dead-letter destination, so a poison message retries forever and can block everything behind it.new Topic({ deliveryGuarantee: "at-least-once" }). Real durable topic with retries and a DLQ configured at the framework level.
setInterval cron in the application processFires once per replica. Three replicas run the daily aggregation three times per tick. The agent's INSERT ... ON CONFLICT DO UPDATE saved it here, but any non-idempotent cron would fire three times for real.new CronJob declares the schedule at the framework level; an external scheduler (Encore Cloud's platform, or a Kubernetes CronJob in self-hosted setups) invokes the endpoint once per tick, so it fires once across the fleet rather than once per replica.
CREATE TABLE IF NOT EXISTS at bootNo migration history. Column renames and NOT-NULL backfills have no version to roll back to and nothing to coordinate with a deploy.Numbered SQL migrations tracked in a _migrations table, applied in order on every deploy.

None of these guarantees show up in Run 1's test suite, which is why the test suite alone wasn't a satisfying answer to the question we set out to ask.

Run 2: what if we add the right libraries to the project?

If the gap was just that the agent didn't know which libraries to use, the fix was to put them in the project for it. We built four library-augmented starter variants with production-grade equivalents to Encore's primitives pre-installed (pg-boss for durable jobs and cron, drizzle-kit/knex/typeorm for migrations, pino for structured logs), and a README explaining what each library was for. Same harness, same model, same prompts, same tasks, same Encore starter.

run 2 · per-framework results · medians across 3 repeats
With libraries pre-installed, the four non-Encore frameworks all regressed.
Tests passed
out of 31
First-try green
out of 3 repeats
Total turns
across t1, t2, t3
Total cost
USD
Express25/310/3190$2.55
Fastify26/310/3220$2.81
Hono28/310/3224$2.35
NestJS24/310/3289$4.13
Encore31/313/3150$1.75

Every framework other than Encore regressed. None of them landed a single first-try-green run across three repeats. Task 1 was largely fine; the agent fell apart on tasks 2 and 3, where it had to integrate the new libraries with the application code it inherited from t1.

run 2 · library integrations, by framework
When the agent had the right tools sitting in the project, it still couldn't wire them in.
ExpressFastifyHonoNestJS
pg-boss
durable queue + cron
createQueue missing
createQueue missing
wiring spread thin
DI provider not registered
ORM
knex / drizzle-kit / typeorm
not central in Run 2
drizzle-kit
drizzle integration
typeorm
Structured logging
pino / nestjs-pino
pino
pino
pino integration
nestjs-pino, partial
clean partial failed in Run 2 not central to Run 2

Every framework tripped on `pg-boss` in some way. Express and Fastify skipped the `boss.createQueue('name')` call that `pg-boss` v10 added as a requirement; NestJS imported the library but forgot to register its wrapping service in the module's `providers` array; Hono spread its failures across all four libraries. ORM and logging integrations landed more often than the queue did, but the queue is the load-bearing piece for the durable-event tests in t2.

The specific failure for Express and Fastify was that the agent wrote pg-boss code that registered a scheduled job without first creating the queue. pg-boss v10 requires queues be explicitly created via boss.createQueue('name') before they can be sent, scheduled, or worked, and the agent didn't know, so t2's server crashed at boot.

error: Queue daily-aggregation not found detail: 'Key (name)=(daily-aggregation) is not present in table "queue".'

On NestJS it was a related but distinct problem: the agent imported pg-boss into a NotificationsService but didn't register the wrapping PgBossService in the module's providers array. On Hono the failures spread across all four pre-installed libraries. The pattern across all four frameworks was the same: the agent reached for the right library and couldn't land the integration cleanly under the 80-turn budget the linked tasks gave it.

Run 3: what if the tests also checked for production-readiness?

Adding the libraries in Run 2 wasn't enough on its own, so in Run 3 we made production-readiness part of what the test suite graded directly.

The main change was to task 3, whose prompt was rewritten so the agent had to satisfy five production-readiness checks alongside the original tracing tests, each backed by an automated test in the suite. The agent now had an explicit target to iterate against instead of a standard it would have to infer.

On the final task we also let the agent run a "ralph wiggum" loop, iterating until the integration actually held together.

Everything else carried over from Run 2, library-augmented starters included.

run 3 · production-readiness rubric · 5 checks × 5 frameworks
What each framework's agent shipped, against the rubric.
ExpressFastifyHonoNestJSEncore
Versioned migrations
numbered files; CTAS at boot doesn't count
CTAS at boot
drizzle-kit
drizzle-kit
typeorm
service migrations
Multi-instance-safe cron
fires once across the fleet, not once per replica
pg-boss
pg-boss
pg-boss
pg-boss
CronJob
Retry policy + DLQ
finite maxRetries, exhausted messages land somewhere terminal
pg-boss
pg-boss
pg-boss
pg-boss
Subscription
Failed-message endpoint
GET /notifications/_failed validated by poison-pill payload
Structured logging
JSON with level + timestamp; console.log doesn't pass
pino
pino
pino
nestjs-pino
encore.dev/log

Encore reaches every check with a single framework primitive. The other four frameworks reach the same checks via a stack of libraries (pg-boss, drizzle-kit or typeorm, pino) glued together. Only Express's agent failed a check: it kept `CREATE TABLE IF NOT EXISTS` for the new schema even though `knex` was already in the project.

Encore reached 36/36 with one primitive per check. The rubric additions landed as small changes against the same primitives the Encore agent had used in Run 1: a retryPolicy on the existing Subscription (Encore's runtime moves exhausted messages to a platform-managed DLQ automatically), encore.dev/log calls for structured logging, and Encore service migrations for the schema-versioning check. The retry change was one line added to the existing subscription declaration:

notifications/notifications.ts
 

Fastify was the cleanest non-Encore result: the agent pulled pg-boss for pub/sub and cron, drizzle-kit for migrations, pino for logs, and hit green on every check. Total cost: $4.60 against Encore's $2.58, but the production semantics matched. Express came one test short, having implemented every check except migrations. Hono and NestJS had wider regressions: tracing failed across both, and NestJS additionally shipped a TypeScript error on the unmodified project that broke the typecheck probe (29/36 and 30/36 respectively).

production-readiness score · 5-check rubric · 0–100%
Encore holds at 100% across all three runs, while the other four only move in Run 3.
0%25%50%75%100%Run 1clean starterRun 2libraries pre-installedRun 3production rubric in testsExpress 80%Fastify 100%Hono 60%NestJS 40%Encore 100%

Each line is one framework's production-readiness score across the three runs. Encore's primitives map directly to the rubric, so the agent gets there on every run. Run 2 didn't move the other four because adding libraries didn't get the agent to integrate them cleanly. Run 3 only moved them because we wrote the rubric into the test suite and let the agent iterate against it. Runs 1 and 2 are back-scored against the same rubric.

Three runs in one chart. Encore stays flat at 100% because its primitives already encode the rubric. The other four start near 20%, didn't move in Run 2 because the agent couldn't integrate the libraries cleanly, and only moved in Run 3 because we wrote the rubric directly into the test suite. Runs 1 and 2 are back-scored against the same rubric.

Show the full Run 3 task 3 prompt
Extend the existing backend to add distributed tracing and bring the solution to production-readiness across all services. The current code in /work is the working solution to the previous tasks. All previously-passing tests must continue to pass. Correlation id + tracing: every incoming HTTP request gets a correlation id (X-Request-Id header if present, otherwise generated). The id propagates through orders to payments, the order-created event, and the notifications subscriber. Each service records a span with { service, operation, started_at, request_id }. GET /orders/:id/trace returns { request_id, spans } for the given order, owner-only. Production-readiness requirements (each verified by an automated check): 1. Multi-instance-safe scheduling. The daily aggregation cron must use a shared coordination layer (Encore CronJob, pg-boss schedule, BullMQ repeatable jobs, @nestjs/schedule + a coordination layer). Raw setInterval or setTimeout inside the application process is not acceptable. 2. Pub/sub retry policy. The notifications subscription must declare a finite retry policy. After the configured maxRetries, failed messages must move to a dead-letter queue (or equivalent terminal failed state). 3. Failed-messages observability endpoint. GET /notifications/_failed (no auth) returns a JSON array of records that exhausted retries. Each record includes order_id (or orderId). If the subscriber receives a payload whose customer_id starts with POISON_TEST, the handler must throw or otherwise fail in a broker-retryable way; after exhausting retries, the failed message must appear at this endpoint. 4. Versioned schema migrations. All schema changes must be expressed as numbered migration files using a migration tool (Encore service migrations, Knex, Drizzle, TypeORM). CREATE TABLE IF NOT EXISTS at app startup is not acceptable. The project must contain a migrations directory with at least two numbered files. 5. Structured logging. Application logs must be valid JSON with at minimum a level (or severity) field and a timestamp field. Do not use console.log for application events. Acceptable: encore.dev/log, Pino, Fastify's Pino logger, NestJS Logger, nestjs-pino. Endpoint scope: previous tasks plus /orders/:id/trace and /notifications/_failed. No /test, /debug, /admin, /internal back-doors.

Cost

cost per chain · USD · medians across repeats
Encore's cost stays roughly stable across the three runs while the other four climb.
$0$1$2$3$4$5$6Run 1Run 2Run 3NestJS $5.95Fastify $4.60Hono $3.76Express $3.04Encore $2.58

NestJS's Run 3 cost ($5.95) is more than double Encore's ($2.58), with fewer tests passing. All four non-Encore frameworks saw costs rise in both Run 2 and Run 3; three of the four climbed hardest in Run 3, when the agent was iterating against the production-readiness rubric on top of the library integration work.

Across all three runs combined Encore came in cheapest at $6.29 of token spend while NestJS was the most expensive at $12.69, roughly twice as much for fewer passing tests. The shape of the line matters as much as the totals: a team running this kind of agent-driven workflow daily will compound the per-run gap into a predictable monthly bill, and Encore's flat line is the only one that doesn't widen with the workload.

What about deployment?

The benchmark stops at "tests pass on a local server," which is also where a real team would start the deployment work: broker, cron, migrations, secrets, IAM, networking. We haven't measured any of this yet. On paper, Encore's design says the source doesn't change to be deployed on different infrastructure: encore build docker myapp:latest produces an image and an infra.config.json swaps NATS for SQS or local Postgres for RDS. For the other four, everything around the application code is still a separate stack to build. The formal version is a Run 4 with a sandboxed cloud account, provisioning checks (broker exists, cron is multi-instance-safe at the cloud level, migrations applied against the deployed DB, secrets handled correctly), and a comparable budget across all five stacks. Until that exists, treat this section as design rather than measurement.

What the comparison actually shows

The framework's defaults shape what the agent ships, more than any prompt or rubric we layer on top. When the shortest path to a green test suite runs through framework primitives that already encode production-grade behavior, the agent's first draft is roughly production-grade. When the shortest path is a hand-rolled queue, a setInterval, or a CREATE TABLE IF NOT EXISTS, that's what the first draft is. Neither pre-installed libraries nor stricter tests closed that gap without considerable extra cost.

If you're a team choosing a backend framework today and AI is doing a meaningful share of your coding, the practical implication is to weight what the framework hands the agent out of the box more heavily than you used to, and to weight router ergonomics less. Integrated primitives beat library compositions for an agent: in our runs, frameworks that gave the agent a single primitive for durable events, cron, and migrations finished with fewer tests broken, lower token cost, and production semantics already in place. Frameworks that left those concerns to library glue paid the bill in token cost and broken tests, and still left the deployment work undone.

Encore happens to be built that way, and the broader infrastructure-from-code direction points the same way. That's where we expect the most interesting framework-level work over the next year to come from.

Encore

This blog is presented by Encore, the backend framework for building robust type-safe distributed systems with declarative infrastructure.

Like this article?
Get future ones straight to your mailbox.

You can unsubscribe at any time.