RAG Explained: How Retrieval-Augmented Generation Actually Works

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1huwl40mxv99gjfyy340.png" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/i

Blizine Admin

Monday, May 25, 2026·2 min read·1 views

Suraj Sharma Posted on May 25 RAG Explained: How Retrieval-Augmented Generation Actually Works # ai # rag # llm # machinelearning The Two Phases of RAG RAG (Retrieval-Augmented Generation) splits into two separate pipelines : Ingestion pipeline — runs once (or on a schedule) to process your documents Query pipeline — runs live for every user request Why Not Just Send All Your Text to the LLM? Three hard problems: Cost — millions of tokens per query = $$$ Context limits — even 128K token windows can't hold an entire knowledge base Quality — LLMs get confused when buried in irrelevant text RAG surgically extracts only the relevant 3–5 chunks needed for each question. Why Store Vectors Instead of Just Doing Text Search? Keywords only find exact word matches. Vectors capture meaning. These three phrases are completely different strings — but nearly identical vectors: "Refunds take 5 days" "money-back in a week" "reimbursement timeline: 5 business days" They cluster close together in embedding space, which is exactly what we want. The Ingestion Pipeline (Step by Step) Why chunk? An LLM has a fixed context window (e.g. 128K tokens). Your knowledge base could be millions of tokens. You can't send it all. Chunking lets you retrieve only the 3–5 most relevant pieces and send those — keeping the prompt small and focused. Overlap prevents losing context at chunk boundaries. Step 1 — Chunking Split documents into ~500-token pieces with overlap so no idea gets cut off at a boundary. Step 2 — Embedding The embedding model (e.g. text-embedding-3-small ) converts each chunk into a vector of ~1536 numbers. Step 3 — Storage Both the vector and the original text are stored in the vector DB together — you need the text back when it's retrieved later. The Query Pipeline (Step by Step) Step 1 — Embed the question When a user asks a question, it goes through the exact same embedding model (critical — different models produce incompatible vector spaces). Step 2 — Similarity search The resu

RAG Explained: How Retrieval-Augmented Generation Actually Works

Related Articles

The Singleton Labyrinth

Build your first MCP server in TypeScript: the 2026 setup that takes 30 minutes.

Check Wallet Balances Across 4 Chains with Zero Dependencies — chain_balance.py

Comments