Retrieval-Augmented Generation (RAG) is the pattern behind most AI features that answer questions about your own data. Instead of fine-tuning a model, you retrieve relevant documents at query time and include them in the prompt. The LLM generates an answer grounded in your actual content rather than relying solely on its training data.
The typical RAG architecture involves three external services: a database for documents, a vector database for embeddings, and an LLM API. But if you already run PostgreSQL, you can cut that to two. pgvector adds vector storage and similarity search directly to Postgres, so your documents and their embeddings live in the same table, in the same transaction. No sync pipeline, no orphaned vectors, no consistency issues.
This tutorial builds a complete RAG pipeline: ingest documents, generate embeddings, store them in pgvector, search by semantic similarity, and generate answers with an LLM. The backend runs on Encore, an open-source TypeScript framework where databases are declared in code and provisioned automatically.
A backend service with four endpoints:
The architecture is straightforward: one Postgres database with pgvector, one embedding API (OpenAI), and one LLM API (OpenAI). No separate vector database, no sync pipeline.
curl -L https://encore.dev/install.sh | bash)encore app create rag-pipeline --example=ts/empty
cd rag-pipeline
Install the OpenAI SDK:
npm install openai
Set your OpenAI API key as a secret:
encore secret set --type dev,local,pr,prod OpenAIKey
Paste your API key when prompted.
Create the service directory and migration:
mkdir -p rag/migrations
Create the migration file at rag/migrations/1_create_documents.up.sql:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
title TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(1536),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
-- HNSW index for fast cosine similarity search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
-- Index for metadata filtering
CREATE INDEX ON documents USING gin (metadata);
This creates a table where each document has a title, content, a 1536-dimension vector column (matching OpenAI's text-embedding-3-small output), and a JSONB metadata column for filtering. The HNSW index enables fast approximate nearest-neighbor search.
Create rag/encore.service.ts:
import { Service } from "encore.dev/service";
export default new Service("rag");
Create rag/db.ts:
import { SQLDatabase } from "encore.dev/storage/sqldb";
export const db = new SQLDatabase("rag", {
migrations: "./migrations",
});
Create rag/secrets.ts:
import { secret } from "encore.dev/config";
export const OpenAIKey = secret("OpenAIKey");
When you run encore run, Encore provisions a real PostgreSQL instance locally (using Docker under the hood), runs the migration (including enabling pgvector), and makes the database available to your service. No connection strings to manage, no environment variables to configure.
Create rag/embeddings.ts:
import OpenAI from "openai";
import { OpenAIKey } from "./secrets";
let client: OpenAI;
function getClient(): OpenAI {
if (!client) {
client = new OpenAI({ apiKey: OpenAIKey() });
}
return client;
}
export async function generateEmbedding(text: string): Promise<number[]> {
const response = await getClient().embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return response.data[0].embedding;
}
export async function chatCompletion(
systemPrompt: string,
userMessage: string
): Promise<string> {
const response = await getClient().chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: userMessage },
],
temperature: 0.3,
max_tokens: 1000,
});
return response.choices[0].message.content ?? "";
}
generateEmbedding sends text to OpenAI's embedding model and returns the 1536-dimension vector. chatCompletion sends a system prompt and user message to the chat model and returns the response. Low temperature (0.3) keeps answers grounded in the provided context rather than getting creative.
Create rag/ingest.ts:
import { api } from "encore.dev/api";
import { db } from "./db";
import { generateEmbedding } from "./embeddings";
interface IngestRequest {
title: string;
content: string;
metadata?: Record<string, string>;
}
interface Document {
id: number;
title: string;
createdAt: string;
}
export const ingest = api(
{ expose: true, method: "POST", path: "/documents" },
async (req: IngestRequest): Promise<Document> => {
// Generate embedding from the document content
const embedding = await generateEmbedding(req.content);
const vectorStr = `[${embedding.join(",")}]`;
const metadata = JSON.stringify(req.metadata ?? {});
const doc = await db.queryRow<Document>`
INSERT INTO documents (title, content, embedding, metadata)
VALUES (${req.title}, ${req.content}, ${vectorStr}::vector, ${metadata}::jsonb)
RETURNING id, title, created_at AS "createdAt"
`;
return doc!;
}
);
export const listDocuments = api(
{ expose: true, method: "GET", path: "/documents" },
async (): Promise<{ documents: Document[] }> => {
const rows = await db.query<Document>`
SELECT id, title, created_at AS "createdAt"
FROM documents
ORDER BY created_at DESC
`;
const documents: Document[] = [];
for await (const row of rows) {
documents.push(row);
}
return { documents };
}
);
The ingest endpoint takes a title, content, and optional metadata. It generates an embedding from the content and stores everything in a single INSERT statement. The document and its vector are written atomically. No sync issues, no orphaned vectors.
Create rag/search.ts:
import { api } from "encore.dev/api";
import { db } from "./db";
import { generateEmbedding } from "./embeddings";
interface SearchRequest {
query: string;
limit?: number;
}
interface SearchResult {
id: number;
title: string;
content: string;
similarity: number;
}
export const search = api(
{ expose: true, method: "POST", path: "/search" },
async (req: SearchRequest): Promise<{ results: SearchResult[] }> => {
const embedding = await generateEmbedding(req.query);
const vectorStr = `[${embedding.join(",")}]`;
const limit = req.limit ?? 5;
const rows = await db.query<SearchResult>`
SELECT id, title, content,
1 - (embedding <=> ${vectorStr}::vector) AS similarity
FROM documents
ORDER BY embedding <=> ${vectorStr}::vector
LIMIT ${limit}
`;
const results: SearchResult[] = [];
for await (const row of rows) {
results.push(row);
}
return { results };
}
);
The search endpoint generates an embedding from the query text, then uses pgvector's <=> operator to find the most similar documents by cosine distance. The 1 - distance conversion gives a similarity score from 0 to 1, where 1 is identical.
The query runs against the HNSW index, which returns approximate nearest neighbors in single-digit milliseconds for datasets up to millions of vectors. For exact results on smaller datasets, you can remove the index and pgvector will do a brute-force scan.
Create rag/ask.ts:
import { api } from "encore.dev/api";
import { db } from "./db";
import { generateEmbedding, chatCompletion } from "./embeddings";
interface AskRequest {
question: string;
limit?: number;
}
interface AskResponse {
answer: string;
sources: { id: number; title: string; similarity: number }[];
}
export const ask = api(
{ expose: true, method: "POST", path: "/ask" },
async (req: AskRequest): Promise<AskResponse> => {
// Step 1: Embed the question
const embedding = await generateEmbedding(req.question);
const vectorStr = `[${embedding.join(",")}]`;
const limit = req.limit ?? 5;
// Step 2: Find similar documents
const rows = await db.query<{
id: number;
title: string;
content: string;
similarity: number;
}>`
SELECT id, title, content,
1 - (embedding <=> ${vectorStr}::vector) AS similarity
FROM documents
ORDER BY embedding <=> ${vectorStr}::vector
LIMIT ${limit}
`;
const contexts: { id: number; title: string; content: string; similarity: number }[] = [];
for await (const row of rows) {
contexts.push(row);
}
// Step 3: Assemble the prompt with retrieved context
const contextText = contexts
.map((doc, i) => `[Document ${i + 1}: ${doc.title}]\n${doc.content}`)
.join("\n\n");
const systemPrompt = [
"You are a helpful assistant.",
"Answer the user's question based on the provided documents.",
"If the documents don't contain enough information",
"to answer the question, say so.",
"Cite the document titles when referencing information.",
"",
"Documents:",
contextText,
].join("\n");
// Step 4: Generate the answer
const answer = await chatCompletion(systemPrompt, req.question);
return {
answer,
sources: contexts.map(({ id, title, similarity }) => ({
id,
title,
similarity,
})),
};
}
);
This is the complete RAG pipeline in one endpoint:
The response includes both the generated answer and the source documents with their similarity scores, so the caller can show attribution.
Start the development server:
encore run
Encore provisions PostgreSQL locally, runs the migration (which enables pgvector and creates the table), and starts the service. Open the local development dashboard at http://localhost:9400 to test the endpoints interactively.
curl -X POST http://localhost:4000/documents \
-H "Content-Type: application/json" \
-d '{
"title": "Encore Framework Overview",
"content": "Encore is an open-source TypeScript and Go backend framework. It uses infrastructure from code, where databases, caches, pub/sub topics, and cron jobs are declared as objects in application code and provisioned automatically. Request processing is handled in Rust, giving Node.js applications up to 9x the throughput of Express.js.",
"metadata": {"category": "framework"}
}'
curl -X POST http://localhost:4000/documents \
-H "Content-Type: application/json" \
-d '{
"title": "Vector Search with pgvector",
"content": "pgvector is a PostgreSQL extension that adds vector storage and similarity search. It supports cosine distance, L2 distance, and inner product operations. With HNSW indexing, it handles millions of vectors with query times under 20ms and recall rates above 95 percent.",
"metadata": {"category": "database"}
}'
curl -X POST http://localhost:4000/documents \
-H "Content-Type: application/json" \
-d '{
"title": "RAG Pipeline Architecture",
"content": "A RAG pipeline has five steps: embed the question, search for similar vectors, retrieve the documents, assemble the prompt with context, and generate the answer. The vector search typically takes 5 to 50 milliseconds. The LLM generation takes 500ms to 3 seconds. The vector search is rarely the bottleneck.",
"metadata": {"category": "architecture"}
}'
curl -X POST http://localhost:4000/search \
-H "Content-Type: application/json" \
-d '{"query": "how fast is vector search?"}'
The search returns documents ranked by semantic similarity. A query about "vector search speed" matches the pgvector document and the RAG architecture document even though the exact words don't appear.
curl -X POST http://localhost:4000/ask \
-H "Content-Type: application/json" \
-d '{"question": "What is Encore and how does it handle infrastructure?"}'
The response includes a generated answer grounded in your ingested documents, plus the source documents with similarity scores.
One advantage of pgvector over dedicated vector databases is that you can use SQL for filtering. Add a filtered search endpoint to rag/search.ts:
interface FilteredSearchRequest {
query: string;
category?: string;
limit?: number;
}
export const filteredSearch = api(
{ expose: true, method: "POST", path: "/search/filtered" },
async (req: FilteredSearchRequest): Promise<{ results: SearchResult[] }> => {
const embedding = await generateEmbedding(req.query);
const vectorStr = `[${embedding.join(",")}]`;
const limit = req.limit ?? 5;
const rows = await db.query<SearchResult>`
SELECT id, title, content,
1 - (embedding <=> ${vectorStr}::vector) AS similarity
FROM documents
WHERE (${req.category}::text IS NULL OR metadata->>'category' = ${req.category})
ORDER BY embedding <=> ${vectorStr}::vector
LIMIT ${limit}
`;
const results: SearchResult[] = [];
for await (const row of rows) {
results.push(row);
}
return { results };
}
);
Because embeddings and application data live in the same table, you can filter by any column (metadata fields, date ranges, joins against other tables) before running the similarity search. With a dedicated vector database, this filtering happens on limited metadata attached to each vector.
Deploy the application to your own AWS or GCP account through Encore Cloud:
git add -A && git commit -m "RAG pipeline with pgvector"
git push encore
Encore provisions a managed PostgreSQL instance with pgvector enabled, runs the migration, and deploys the service. The same code that runs locally runs in production. Set the OpenAI secret for the production environment:
encore secret set --type prod OpenAIKey
For self-hosting, generate a Docker image:
encore build docker rag-pipeline:latest
Embedding latency. The OpenAI embedding API call takes 100-300ms per request. For batch ingestion, use the batch endpoint to embed multiple documents in a single API call. For search, the embedding call is the main latency contributor. The vector search itself takes 5-20ms.
Index tuning. The HNSW index has two parameters that affect the speed/accuracy tradeoff:
m (default 16): connections per layer. Higher values improve recall but use more memory and slow down inserts.ef_construction (default 64): candidates during index build. Higher values improve recall at the cost of slower index creation.For most workloads, the defaults are fine. If you need higher recall on larger datasets:
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 128);
Query tuning. Set hnsw.ef_search to control how many candidates the index considers at query time:
SET hnsw.ef_search = 100; -- default is 40
Higher values improve recall at the cost of query latency. For datasets under a million vectors, the default is usually sufficient.
When to add a dedicated vector database. pgvector handles millions of vectors with HNSW indexing on a properly sized Postgres instance. If your workload grows to hundreds of millions of vectors, requires auto-scaling search throughput, or needs per-tenant index isolation at scale, consider a dedicated vector database. For a comparison of options, see Best Vector Databases in 2026.
tsvector works alongside pgvector in the same query.Have questions about building with pgvector? Join our Discord community where developers discuss RAG architecture and vector search daily.