How does an AI coding agent compare on Express against Encore when you give it the same realistic backend tasks?
We took Claude Code, pointed it at the same project (an HTTP API with persistence, a pub/sub event, a daily cron, and distributed tracing), and ran it on both frameworks using the same prompts, the same model, the same Postgres setup, and the same VM. We captured every artifact and scored both outputs against the same rubric. The work is part of our wider AI-readiness benchmark across five TypeScript frameworks; this article focuses on the Express side of that comparison.
After the first run both frameworks passed the test suite (31/31 each), and the obvious thing to publish was something like "Express is fine for AI coding, Encore is fine too." Then we read the diffs and the picture changed. On Express the agent had built the laziest solution that would still pass the tests: a Postgres table polled by setInterval for the durable queue, CREATE TABLE IF NOT EXISTS at boot in place of any migration system, and setInterval again for the daily cron. On Encore it had reached for the framework's Topic/Subscription, CronJob, and numbered SQL migrations and ended up with code that was roughly production-ready on the first try. We then ran two more passes against Express to see whether we could close the gap.
Everything is reproducible. Full repo, prompts, starters, and transcripts at github.com/encoredev/ai-backend-benchmark.
Each framework gets its own VM with the same Postgres setup and the same claude-sonnet-4-6 model running through Claude Code. The agent works through three linked tasks: t1 (HTTP API and persistence), then t2 (extend t1 with pub/sub and cron, starting from t1's working directory), then t3 (extend t2 with tracing and production-readiness, starting from t2's). That linkage matters because some failure modes only show up when the agent has to extend existing code.
The tests are plain black-box HTTP assertions, run with vitest, and they are the same probes against every framework. The agent never reads the source of the tests. The Express starter is whatever express-generator (TypeScript variant) produces. The Encore starter is what encore app create produces, which also includes Encore's CLAUDE.md and MCP server, since the comparison we wanted is between the two frameworks as they arrive today.
In Run 1 the agent hit 31/31 on Express and the diffs converged on three patterns that turn up on every non-Encore framework we tested.
For the durable pub/sub the agent created a Postgres table and polled it on a timer, with retry counters maintained in the same row:
await pool.query(
`INSERT INTO event_queue (event_type, payload, status) VALUES ($1, $2, 'pending')`,
['order-created', { order_id: id }]
);
setInterval(async () => {
const { rows } = await client.query(`
SELECT * FROM event_queue
WHERE status = 'pending' AND next_retry_at <= NOW()
ORDER BY id LIMIT 1 FOR UPDATE SKIP LOCKED
`);
// process, mark done, or bump retry counter
}, 500);
For the daily aggregation cron the agent scheduled a setTimeout chain running inside the application process at boot:
function startScheduler() {
const scheduleNext = () => {
const next = nextMidnightUTC();
setTimeout(() => {
runDailyAggregation().catch(console.error);
scheduleNext();
}, next.getTime() - Date.now());
};
scheduleNext();
}
For schema the agent called CREATE TABLE IF NOT EXISTS plus the occasional ALTER TABLE ... ADD COLUMN IF NOT EXISTS on startup, with no migration tracking. The schema was whatever the boot script happened to execute, with no version history to coordinate with deploys.
On Encore the same agent given the same prompts declared the async work using the framework's primitives. Pub/sub was a Topic with a typed Subscription:
// orders/events.ts
export const orderCreated = new Topic<OrderCreatedEvent>("order-created", {
deliveryGuarantee: "at-least-once",
});
// notifications/notifications.ts
new Subscription(orderCreated, "send-notification", {
handler: async (event) => { /* ... */ },
});
The cron was a CronJob:
const _ = new CronJob("daily-aggregation", {
every: "24h",
endpoint: runDailyAggregation,
});
Schema migrations went into numbered SQL files (migrations/1_create_orders.up.sql, migrations/2_add_request_id.up.sql) which Encore tracks and applies in order on every deploy. For the tracing task in t3 the agent did not thread request_id through function arguments the way the Express agent did, because Encore's runtime propagates the correlation id automatically across the typed call to payments, the subscription handler, and the cron invocation; the agent just imported encore.dev/log and called log.info(...) from the handlers.
Both pass the same test suite in Run 1. They behave very differently the moment you deploy them.
| What the agent built on Express | Production weakness | What the agent built on Encore |
|---|---|---|
Postgres queue polled by setInterval | The application database doubles as the event bus, and there is no dead-letter destination, so a poison message retries forever and can block everything behind it. | Topic with at-least-once delivery, retries and a platform-managed DLQ configured at the framework level. |
setInterval cron in the application process | Fires once per replica; three replicas run the daily aggregation three times per tick. The agent's INSERT ... ON CONFLICT DO UPDATE happened to make this idempotent, but any non-idempotent cron would fire three times for real. | CronJob declares the schedule at the framework level, and an external scheduler invokes the endpoint once per tick across the fleet. |
CREATE TABLE IF NOT EXISTS at boot | No migration history; column renames and NOT-NULL backfills have no version to roll back to and nothing to coordinate with a deploy. | Numbered SQL migrations tracked in a _migrations table, applied in order on every deploy. |
None of these guarantees show up in the Run 1 test suite, which is why the test suite alone was not a satisfying answer to the question.
We ran two more passes to see whether we could get Express to the same place.
In Run 2 we pre-installed the production-grade equivalents of Encore's primitives (pg-boss for durable jobs and cron, drizzle-kit for migrations, pino for structured logs) into the Express starter, with a README explaining what each library was for. Express regressed: the agent wrote pg-boss code that registered a scheduled job without first creating the queue, and the server crashed at boot with Queue daily-aggregation not found / Key (name)=(daily-aggregation) is not present in table "queue". pg-boss v10 requires queues to be explicitly created via boss.createQueue('name') before they can be sent, scheduled, or worked, and the agent did not know.
In Run 3 we made production-readiness part of what the test suite graded directly, with five additional checks alongside the original tracing tests: multi-instance-safe scheduling, a finite retry policy with a dead-letter destination, a failed-messages observability endpoint, versioned schema migrations, and structured logging. On Express the agent landed four of the five and missed only the migrations check. Encore reached all five with a one-line change to the existing subscription:
new Subscription(orderCreated, "send-notification", {
handler: async (event) => { /* ... */ },
retryPolicy: { maxRetries: 3 },
});
Across all three runs Encore's total token spend was $6.29. Express climbed run over run as the agent iterated on library integration in Run 2 and on the production-readiness checks in Run 3. For a team running this kind of agent-driven workflow daily the per-run difference shows up in the monthly bill.
Express remains a reasonable pick if you are writing a small API where a human will do the production hardening, or if you have an existing Express codebase with deep internal middleware and tooling you do not want to replace, or if you are building a prototype you will throw away. What the benchmark shows is that when an AI agent is writing a meaningful share of the code, the lack of framework-level primitives for durable events, scheduling, and migrations stops being something the human absorbs and becomes something the agent has to reinvent on every task.
Encore is built so that durable events, scheduling, and migrations are first-class framework concerns rather than library compositions, which means an AI agent reaches for them by default and the production-readiness checks land incidentally. For teams choosing a TypeScript backend in 2026 where AI is writing a meaningful share of the code, that is where we would point you.
Clone the repo, point it at your own framework, or rewrite the rubric to match your own definition of production-ready: github.com/encoredev/ai-backend-benchmark.