How does an AI coding agent compare on Hono against Encore when you give it the same realistic backend tasks?
We took Claude Code, pointed it at the same project (an HTTP API with persistence, a pub/sub event, a daily cron, and distributed tracing), and ran it on both frameworks using the same prompts, the same model, the same Postgres setup, and the same VM. This article focuses on the Hono side of our wider AI-readiness benchmark across five TypeScript frameworks. Hono finished the baseline run cheapest of any framework at $1.55 per run, but by the time we graded the output against a production-readiness rubric the same framework had dropped to 29 of 36 checks, with tracing failing on a probe with an unknown order id.
Full repo, prompts, starters, and transcripts at github.com/encoredev/ai-backend-benchmark.
Each framework gets its own VM with the same Postgres setup and the same claude-sonnet-4-6 model running through Claude Code. The agent works through three linked tasks: t1 (HTTP API and persistence), then t2 (extend t1 with pub/sub and cron), then t3 (extend t2 with tracing and production-readiness). The tests are plain black-box HTTP probes run with vitest, and they are the same against every framework. The Hono starter is what npm create hono@latest produces; the Encore starter is what encore app create produces. Of the five frameworks we benchmarked, Hono is the only non-Encore one that ships any agent-facing materials at all (an llms.txt); Encore additionally ships a CLAUDE.md, an MCP server, and a dedicated AI-integration docs page.
Hono's Run 1 was the cheapest of the benchmark at $1.55. The agent built minimalist Hono routes for the HTTP API, a Postgres queue table polled by setInterval for the durable pub/sub, a setInterval chain inside the worker process for the daily cron, and CREATE TABLE IF NOT EXISTS at boot for the schema. The cost win came from doing less code overall rather than from doing the right thing; Hono's small surface area kept the agent's transcript short, but the implementation choices were the same anti-patterns we saw on Express, Fastify, and NestJS.
One Hono-specific choice mattered later. For the tracing task in t3 the agent keyed spans on order_id rather than adding a request_id column to the orders table, which is the design choice that broke when the Run 3 test suite probed /orders/:id/trace with an order id the system had never seen.
The same agent given the same prompts declared the async work using Encore's primitives. Pub/sub was a Topic with a typed Subscription and deliveryGuarantee: "at-least-once". The cron was a CronJob. Schema migrations went into numbered SQL files. For tracing the agent relied on Encore's runtime to propagate the correlation id across calls and handlers automatically, populating /orders/:id/trace from the runtime's tracing surface rather than from an ad hoc spans table.
In Run 2 we pre-installed pg-boss, drizzle-kit, and pino in the Hono starter with a README explaining each. Hono's regressions in Run 2 spread across all four pre-installed libraries: the agent failed to land any of the integrations cleanly under the linked-task turn budget, with cascading failures from t2 onward. By Run 3, when we wrote the production-readiness rubric directly into the test suite, Hono finished at 29 of 36 checks, the worst rubric score of any framework other than NestJS. The proximate cause was the same order_id-keyed tracing design from earlier runs, which could not answer a probe for a request_id that lived on a request that never created an order.
Both frameworks pass the same test suite in Run 1, but they behave very differently the moment you deploy.
| What the agent built on Hono | Production weakness | What the agent built on Encore |
|---|---|---|
Postgres queue polled by setInterval | The application database doubles as the event bus, and there is no dead-letter destination, so a poison message retries forever. | Topic with at-least-once delivery, retries and a platform-managed DLQ configured at the framework level. |
setInterval cron in the worker process | Fires once per replica. | CronJob invoked once per tick across the fleet by an external scheduler. |
CREATE TABLE IF NOT EXISTS at boot | No migration history. | Numbered SQL migrations tracked in _migrations, applied in order on every deploy. |
Spans keyed on order_id | Lookups by request_id (or any other correlation surface that does not have an order yet) return nothing. | Runtime-propagated correlation id, populated from the framework's tracing surface. |
Hono's small surface and edge-runtime defaults make it a good fit for stateless functions and request-response APIs where the platform layer handles durability and scheduling. The benchmark stressed exactly the seams Hono is not designed to cover (durable jobs, scheduling that has to survive across replicas, schema versioning, correlation across async handlers), and on a workload like that the price of being lightweight is that the agent has to invent everything that the framework does not provide.
Hono is a good pick for edge-only or serverless-first APIs where production-readiness lives at the platform layer (Cloudflare Workers, Vercel, Bun), or for stateless request-response APIs where you do not need durable queues, multi-instance cron, or framework-level tracing.
If you are building a TypeScript backend that needs durable events, multi-instance-safe scheduling, versioned migrations, and request-scoped tracing, and an AI agent is writing a meaningful share of the code, Encore's primitives let the agent reach for the right thing on the first pass.
Clone the repo, point it at your own framework, or rewrite the rubric to match your own definition of production-ready: github.com/encoredev/ai-backend-benchmark.