03/09/26

How to Build a RAG Pipeline with TypeScript and PostgreSQL

Embed documents, store vectors in pgvector, search by meaning, and generate answers with an LLM

11 Min Read

Retrieval-Augmented Generation (RAG) is the pattern behind most AI features that answer questions about your own data. Instead of fine-tuning a model, you retrieve relevant documents at query time and include them in the prompt. The LLM generates an answer grounded in your actual content rather than relying solely on its training data.

The typical RAG architecture involves three external services: a database for documents, a vector database for embeddings, and an LLM API. But if you already run PostgreSQL, you can cut that to two. pgvector adds vector storage and similarity search directly to Postgres, so your documents and their embeddings live in the same table, in the same transaction. No sync pipeline, no orphaned vectors, no consistency issues.

This tutorial builds a complete RAG pipeline: ingest documents, generate embeddings, store them in pgvector, search by semantic similarity, and generate answers with an LLM. The backend runs on Encore, an open-source TypeScript framework where databases are declared in code and provisioned automatically.

What You'll Build

A backend service with four endpoints:

  1. POST /documents ingest a document, generate its embedding, and store both in Postgres
  2. POST /search find documents similar to a query using vector similarity
  3. POST /ask full RAG: search for relevant documents and generate an answer with an LLM
  4. GET /documents list all ingested documents

The architecture is straightforward: one Postgres database with pgvector, one embedding API (OpenAI), and one LLM API (OpenAI). No separate vector database, no sync pipeline.

Prerequisites

Step 1: Create the Encore App

encore app create rag-pipeline --example=ts/empty
cd rag-pipeline

Install the OpenAI SDK:

npm install openai

Set your OpenAI API key as a secret:

encore secret set --type dev,local,pr,prod OpenAIKey

Paste your API key when prompted.

Step 2: Create the Database Migration

Create the service directory and migration:

mkdir -p rag/migrations

Create the migration file at rag/migrations/1_create_documents.up.sql:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id BIGSERIAL PRIMARY KEY,
  title TEXT NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1536),
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- HNSW index for fast cosine similarity search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

-- Index for metadata filtering
CREATE INDEX ON documents USING gin (metadata);

This creates a table where each document has a title, content, a 1536-dimension vector column (matching OpenAI's text-embedding-3-small output), and a JSONB metadata column for filtering. The HNSW index enables fast approximate nearest-neighbor search.

Step 3: Declare the Database and Secrets

Create rag/encore.service.ts:

import { Service } from "encore.dev/service";

export default new Service("rag");

Create rag/db.ts:

import { SQLDatabase } from "encore.dev/storage/sqldb";

export const db = new SQLDatabase("rag", {
  migrations: "./migrations",
});

Create rag/secrets.ts:

import { secret } from "encore.dev/config";

export const OpenAIKey = secret("OpenAIKey");

When you run encore run, Encore provisions a real PostgreSQL instance locally (using Docker under the hood), runs the migration (including enabling pgvector), and makes the database available to your service. No connection strings to manage, no environment variables to configure.

Step 4: Build the Embedding Helper

Create rag/embeddings.ts:

import OpenAI from "openai";
import { OpenAIKey } from "./secrets";

let client: OpenAI;

function getClient(): OpenAI {
  if (!client) {
    client = new OpenAI({ apiKey: OpenAIKey() });
  }
  return client;
}

export async function generateEmbedding(text: string): Promise<number[]> {
  const response = await getClient().embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding;
}

export async function chatCompletion(
  systemPrompt: string,
  userMessage: string
): Promise<string> {
  const response = await getClient().chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user", content: userMessage },
    ],
    temperature: 0.3,
    max_tokens: 1000,
  });
  return response.choices[0].message.content ?? "";
}

generateEmbedding sends text to OpenAI's embedding model and returns the 1536-dimension vector. chatCompletion sends a system prompt and user message to the chat model and returns the response. Low temperature (0.3) keeps answers grounded in the provided context rather than getting creative.

Step 5: Build the Ingest Endpoint

Create rag/ingest.ts:

import { api } from "encore.dev/api";
import { db } from "./db";
import { generateEmbedding } from "./embeddings";

interface IngestRequest {
  title: string;
  content: string;
  metadata?: Record<string, string>;
}

interface Document {
  id: number;
  title: string;
  createdAt: string;
}

export const ingest = api(
  { expose: true, method: "POST", path: "/documents" },
  async (req: IngestRequest): Promise<Document> => {
    // Generate embedding from the document content
    const embedding = await generateEmbedding(req.content);
    const vectorStr = `[${embedding.join(",")}]`;
    const metadata = JSON.stringify(req.metadata ?? {});

    const doc = await db.queryRow<Document>`
      INSERT INTO documents (title, content, embedding, metadata)
      VALUES (${req.title}, ${req.content}, ${vectorStr}::vector, ${metadata}::jsonb)
      RETURNING id, title, created_at AS "createdAt"
    `;

    return doc!;
  }
);

export const listDocuments = api(
  { expose: true, method: "GET", path: "/documents" },
  async (): Promise<{ documents: Document[] }> => {
    const rows = await db.query<Document>`
      SELECT id, title, created_at AS "createdAt"
      FROM documents
      ORDER BY created_at DESC
    `;

    const documents: Document[] = [];
    for await (const row of rows) {
      documents.push(row);
    }

    return { documents };
  }
);

The ingest endpoint takes a title, content, and optional metadata. It generates an embedding from the content and stores everything in a single INSERT statement. The document and its vector are written atomically. No sync issues, no orphaned vectors.

Step 6: Build the Search Endpoint

Create rag/search.ts:

import { api } from "encore.dev/api";
import { db } from "./db";
import { generateEmbedding } from "./embeddings";

interface SearchRequest {
  query: string;
  limit?: number;
}

interface SearchResult {
  id: number;
  title: string;
  content: string;
  similarity: number;
}

export const search = api(
  { expose: true, method: "POST", path: "/search" },
  async (req: SearchRequest): Promise<{ results: SearchResult[] }> => {
    const embedding = await generateEmbedding(req.query);
    const vectorStr = `[${embedding.join(",")}]`;
    const limit = req.limit ?? 5;

    const rows = await db.query<SearchResult>`
      SELECT id, title, content,
             1 - (embedding <=> ${vectorStr}::vector) AS similarity
      FROM documents
      ORDER BY embedding <=> ${vectorStr}::vector
      LIMIT ${limit}
    `;

    const results: SearchResult[] = [];
    for await (const row of rows) {
      results.push(row);
    }

    return { results };
  }
);

The search endpoint generates an embedding from the query text, then uses pgvector's <=> operator to find the most similar documents by cosine distance. The 1 - distance conversion gives a similarity score from 0 to 1, where 1 is identical.

The query runs against the HNSW index, which returns approximate nearest neighbors in single-digit milliseconds for datasets up to millions of vectors. For exact results on smaller datasets, you can remove the index and pgvector will do a brute-force scan.

Step 7: Build the RAG Endpoint

Create rag/ask.ts:

import { api } from "encore.dev/api";
import { db } from "./db";
import { generateEmbedding, chatCompletion } from "./embeddings";

interface AskRequest {
  question: string;
  limit?: number;
}

interface AskResponse {
  answer: string;
  sources: { id: number; title: string; similarity: number }[];
}

export const ask = api(
  { expose: true, method: "POST", path: "/ask" },
  async (req: AskRequest): Promise<AskResponse> => {
    // Step 1: Embed the question
    const embedding = await generateEmbedding(req.question);
    const vectorStr = `[${embedding.join(",")}]`;
    const limit = req.limit ?? 5;

    // Step 2: Find similar documents
    const rows = await db.query<{
      id: number;
      title: string;
      content: string;
      similarity: number;
    }>`
      SELECT id, title, content,
             1 - (embedding <=> ${vectorStr}::vector) AS similarity
      FROM documents
      ORDER BY embedding <=> ${vectorStr}::vector
      LIMIT ${limit}
    `;

    const contexts: { id: number; title: string; content: string; similarity: number }[] = [];
    for await (const row of rows) {
      contexts.push(row);
    }

    // Step 3: Assemble the prompt with retrieved context
    const contextText = contexts
      .map((doc, i) => `[Document ${i + 1}: ${doc.title}]\n${doc.content}`)
      .join("\n\n");

    const systemPrompt = [
      "You are a helpful assistant.",
      "Answer the user's question based on the provided documents.",
      "If the documents don't contain enough information",
      "to answer the question, say so.",
      "Cite the document titles when referencing information.",
      "",
      "Documents:",
      contextText,
    ].join("\n");

    // Step 4: Generate the answer
    const answer = await chatCompletion(systemPrompt, req.question);

    return {
      answer,
      sources: contexts.map(({ id, title, similarity }) => ({
        id,
        title,
        similarity,
      })),
    };
  }
);

This is the complete RAG pipeline in one endpoint:

  1. The question is embedded into a vector using the same model that embedded the documents.
  2. pgvector finds the most similar documents by cosine distance.
  3. The retrieved documents are injected into the system prompt as context.
  4. The LLM generates an answer grounded in those documents, with instructions to cite sources.

The response includes both the generated answer and the source documents with their similarity scores, so the caller can show attribution.

Step 8: Run and Test

Start the development server:

encore run

Encore provisions PostgreSQL locally, runs the migration (which enables pgvector and creates the table), and starts the service. Open the local development dashboard at http://localhost:9400 to test the endpoints interactively.

Ingest some documents

curl -X POST http://localhost:4000/documents \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Encore Framework Overview",
    "content": "Encore is an open-source TypeScript and Go backend framework. It uses infrastructure from code, where databases, caches, pub/sub topics, and cron jobs are declared as objects in application code and provisioned automatically. Request processing is handled in Rust, giving Node.js applications up to 9x the throughput of Express.js.",
    "metadata": {"category": "framework"}
  }'
curl -X POST http://localhost:4000/documents \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Vector Search with pgvector",
    "content": "pgvector is a PostgreSQL extension that adds vector storage and similarity search. It supports cosine distance, L2 distance, and inner product operations. With HNSW indexing, it handles millions of vectors with query times under 20ms and recall rates above 95 percent.",
    "metadata": {"category": "database"}
  }'
curl -X POST http://localhost:4000/documents \
  -H "Content-Type: application/json" \
  -d '{
    "title": "RAG Pipeline Architecture",
    "content": "A RAG pipeline has five steps: embed the question, search for similar vectors, retrieve the documents, assemble the prompt with context, and generate the answer. The vector search typically takes 5 to 50 milliseconds. The LLM generation takes 500ms to 3 seconds. The vector search is rarely the bottleneck.",
    "metadata": {"category": "architecture"}
  }'

Search by meaning

curl -X POST http://localhost:4000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "how fast is vector search?"}'

The search returns documents ranked by semantic similarity. A query about "vector search speed" matches the pgvector document and the RAG architecture document even though the exact words don't appear.

Ask a question

curl -X POST http://localhost:4000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is Encore and how does it handle infrastructure?"}'

The response includes a generated answer grounded in your ingested documents, plus the source documents with similarity scores.

Step 9: Add Metadata Filtering

One advantage of pgvector over dedicated vector databases is that you can use SQL for filtering. Add a filtered search endpoint to rag/search.ts:

interface FilteredSearchRequest {
  query: string;
  category?: string;
  limit?: number;
}

export const filteredSearch = api(
  { expose: true, method: "POST", path: "/search/filtered" },
  async (req: FilteredSearchRequest): Promise<{ results: SearchResult[] }> => {
    const embedding = await generateEmbedding(req.query);
    const vectorStr = `[${embedding.join(",")}]`;
    const limit = req.limit ?? 5;

    const rows = await db.query<SearchResult>`
      SELECT id, title, content,
             1 - (embedding <=> ${vectorStr}::vector) AS similarity
      FROM documents
      WHERE (${req.category}::text IS NULL OR metadata->>'category' = ${req.category})
      ORDER BY embedding <=> ${vectorStr}::vector
      LIMIT ${limit}
    `;

    const results: SearchResult[] = [];
    for await (const row of rows) {
      results.push(row);
    }

    return { results };
  }
);

Because embeddings and application data live in the same table, you can filter by any column (metadata fields, date ranges, joins against other tables) before running the similarity search. With a dedicated vector database, this filtering happens on limited metadata attached to each vector.

Step 10: Deploy to Production

Deploy the application to your own AWS or GCP account through Encore Cloud:

git add -A && git commit -m "RAG pipeline with pgvector"
git push encore

Encore provisions a managed PostgreSQL instance with pgvector enabled, runs the migration, and deploys the service. The same code that runs locally runs in production. Set the OpenAI secret for the production environment:

encore secret set --type prod OpenAIKey

For self-hosting, generate a Docker image:

encore build docker rag-pipeline:latest

Performance Considerations

Embedding latency. The OpenAI embedding API call takes 100-300ms per request. For batch ingestion, use the batch endpoint to embed multiple documents in a single API call. For search, the embedding call is the main latency contributor. The vector search itself takes 5-20ms.

Index tuning. The HNSW index has two parameters that affect the speed/accuracy tradeoff:

  • m (default 16): connections per layer. Higher values improve recall but use more memory and slow down inserts.
  • ef_construction (default 64): candidates during index build. Higher values improve recall at the cost of slower index creation.

For most workloads, the defaults are fine. If you need higher recall on larger datasets:

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 128);

Query tuning. Set hnsw.ef_search to control how many candidates the index considers at query time:

SET hnsw.ef_search = 100; -- default is 40

Higher values improve recall at the cost of query latency. For datasets under a million vectors, the default is usually sufficient.

When to add a dedicated vector database. pgvector handles millions of vectors with HNSW indexing on a properly sized Postgres instance. If your workload grows to hundreds of millions of vectors, requires auto-scaling search throughput, or needs per-tenant index isolation at scale, consider a dedicated vector database. For a comparison of options, see Best Vector Databases in 2026.

Next Steps

  • Add chunking. Long documents should be split into smaller chunks before embedding. A 10,000-word document produces a single embedding that averages out the meaning. Breaking it into 500-word chunks produces more specific embeddings that match targeted queries better.
  • Add hybrid search. Combine vector similarity with full-text search for better results. PostgreSQL's built-in tsvector works alongside pgvector in the same query.
  • Read the concepts. For a deeper understanding of how vector search works under the hood (distance metrics, index structures, and when you do and don't need a dedicated vector database), see You probably don't need a vector database.

Have questions about building with pgvector? Join our Discord community where developers discuss RAG architecture and vector search daily.

Ready to build your next backend?

Encore is the Open Source framework for building robust type-safe distributed systems with declarative infrastructure.