What is RAG? Retrieval-Augmented Generation Explained

You saw “RAG” in a job posting last week. Or a GitHub README. Or someone said “we’re building a RAG pipeline” in a meeting and you nodded along. But when you searched for what it actually means, you got architecture diagrams and research papers that assumed you already understood the basics.

Here’s what RAG is, how it works at the level that matters for building with it, and when you’d use it versus something else.

The Problem RAG Solves

Every language model’s knowledge is frozen at training time. Anything after the training cutoff isn’t in the model’s head. And even before the cutoff, private data isn’t there at all: your company’s internal docs, your knowledge base, your product catalog, your support tickets.

This creates a specific failure mode. You ask the model about your refund policy. It doesn’t know it. But it doesn’t say that. It generates a plausible-sounding answer from patterns it learned elsewhere. The answer sounds confident. It’s wrong.

AI hallucinations happen because models predict plausible text, not true facts. RAG doesn’t fix the prediction mechanism. It works around the knowledge gap: give the model the relevant information before asking it to generate, and it can reason over facts it never saw during training.

That’s the one-sentence definition of what is RAG: retrieve relevant documents at query time, inject them as context, then generate an answer grounded in that context.

What RAG Actually Is

RAG stands for Retrieval-Augmented Generation. The three words describe the architecture:

Retrieval: find documents relevant to the user’s query
Augmented: add those documents to the prompt as context
Generation: let the model generate an answer using that context

It’s not a model. It’s not a fine-tuning technique. It’s a pattern you can bolt onto any LLM: retrieve first, then generate.

The key thing to understand is that models are already good at reading and reasoning over text you give them in the context window. You can paste a policy document into a prompt, ask “what does this say about returns?”, and the model will extract the right answer. RAG automates this at scale. Instead of you pasting documents manually, a retrieval system finds and injects the right ones for each query, automatically.

How RAG Works: The Three Parts

Every RAG system has three components. Understanding each one tells you where things can go wrong.

The Document Store

First, you need your knowledge base: your PDFs, your documentation, your support articles. These get split into chunks (typically 200-500 tokens each) and converted into embeddings.

Embeddings are numerical representations of meaning. A sentence like “the refund window is 30 days” becomes a list of about 1,500 numbers that captures its semantic content. You store these vectors in a vector database, such as Pinecone, pgvector (if you’re already on Postgres), or Qdrant. The vector database’s job is to find documents semantically similar to a query, fast.

The Retriever

When a user asks a question, the retriever converts that query into an embedding using the same model that embedded your documents. Then it runs a similarity search: find the top-K chunks whose embeddings are closest to the query.

This is semantic search, not keyword search. “Can I get a refund on worn shoes?” will retrieve your return policy even if that policy document never uses the word “refund.” The retriever matches meaning. The original RAG paper from Meta AI Research introduced this retrieval-then-generation approach in 2020, and the core mechanic hasn’t changed much since.

The Generator

The retrieved chunks get injected into the prompt as context. The model then answers the original question using those chunks. A typical prompt looks like:

You are a helpful assistant. Answer only based on the context provided below.

Context:
[Retrieved chunk 1: return policy section]
[Retrieved chunk 2: exceptions and edge cases]

Question: Can I return shoes I've worn once?

Notice “answer only based on the context provided.” That instruction is doing real work. It tells the model not to fill gaps with its own training knowledge. Without it, smaller models will blend retrieved facts with hallucinated ones.

What Actually Goes Wrong

RAG reduces hallucinations on knowledge-specific tasks. But it introduces three new failure modes, and I’ve seen each of these trip up production systems.

Retrieval misses the right document. Your document is in the store, but the similarity search doesn’t rank it high enough. This happened to me the first time I built a RAG system: the answer was there, but the retriever returned the wrong chunk because it was chunked too broadly. Chunk size matters more than most people expect. Five hundred tokens is a reasonable starting point, but test with 20-30 representative queries before finalising.

The context window fills up. If you retrieve too many chunks, or chunks are too long, you exceed the model’s context limit. The model starts truncating or ignoring later context. Standard fix: retrieve 5-7 chunks maximum and keep each chunk under 500 tokens. Gemini 2.5 Pro supports 1M tokens, which gives you more headroom, but smaller models cut off much earlier.

The model ignores the context. You give the model a document, and it answers from its training data anyway. This happens with smaller, less instruction-following models. The fix is a strong, explicit instruction: “Answer only from the provided context. If the answer isn’t in the context, say ‘I don’t have that information.’” This should go in your system prompt, not just the user turn. More on how system instructions change model behavior in System Instructions: The God Mode of LLMs.

RAG vs Fine-Tuning

People often ask which to use. They solve different problems, so the question answers itself once you know what each one does.

RAG adds knowledge at query time. The model’s weights don’t change. You update your document store without touching the model. Retrieval misses are possible, but the knowledge stays current and auditable.

Fine-tuning bakes knowledge (or behavior) into the model’s weights. Updates require retraining. But the model internalizes patterns in a way RAG can’t: tone, specialized terminology, task-specific output formats.

	RAG	Fine-tuning
Best for	Changing knowledge, private data, citations	New behavior, specialized tone, domain vocabulary
Updates	Add or edit docs in the store	Retrain the model
Latency overhead	Retrieval adds roughly 50-200ms	None at inference time
Cost to start	Embedding + vector DB setup	Training compute

Use RAG when your knowledge changes frequently or when you need citations. Use fine-tuning when you need the model to behave differently, not just know different things. And most production systems eventually use both: a fine-tuned base model with a RAG layer on top for live knowledge.

Try It Yourself

TinkerLLM’s RAG learning unit walks you through building a simple retrieval pipeline and observing the retrieval-generation loop with a real model. You send queries, see which chunks get retrieved, and compare answers with and without context injection.

Module 1 (50 exercises) is free, no card needed. The RAG unit is in Module 2.

Open Lesson 16: RAG, Giving LLMs a Knowledge Base They Were Never Trained On →

FAQ

Do I need a vector database to build RAG?

Not at first. For small document sets, in-memory similarity search works fine and requires no infrastructure. FAISS from Meta runs entirely in memory with a few lines of Python. You move to a real vector database when your document set grows large enough that in-memory search slows down, or when you need multi-user access with persistence. Get retrieval working first, then add infrastructure when you actually need it.

How is RAG different from just pasting documents into the prompt?

For short documents, pasting works. The difference is scale and precision. If you have 1,000 pages of documentation, you can’t paste all of it into every prompt. RAG retrieves only the relevant chunks, so you’re sending 500-1,500 tokens of context instead of 500,000. You also get semantic matching: the retriever finds conceptually relevant sections even when a user’s question is phrased differently from how the answer is written in your docs. That gap is where keyword search fails and embeddings win.

What embedding model should I use?

For English-only applications, my starting point is Google’s text-embedding-004 via the Gemini API embeddings endpoint. It’s free at reasonable volumes and performs well on semantic similarity tasks. For multilingual applications, BGE-M3 or Cohere’s embed-multilingual-v3 are worth testing. Don’t spend too long on this choice early. Chunk quality and chunk size usually matter more than the embedding model, and you can swap the embedding layer later without rebuilding the whole pipeline.

Can RAG eliminate hallucinations?

No. RAG reduces hallucinations on topics covered by your retrieved documents. But the model can still hallucinate when it misreads the retrieved context, when the retrieved chunks don’t actually contain the answer, or when it blends retrieved facts with its own training patterns. It’s also possible to hallucinate about the documents themselves: the model may claim a document says something it doesn’t. RAG shifts the problem, it doesn’t eliminate it. For high-stakes facts, verification against the source document is still necessary.

Is RAG only useful for chatbots?

No. RAG is a retrieval pattern, and it applies anywhere you want the model to work with specific, fresh, or private information. You can use it for document summarization (retrieve relevant sections, summarize them), for classification with examples (retrieve labeled examples as few-shot context), for code generation (retrieve relevant API docs before generating), or for search systems that need explainable results. The chatbot use case is the most visible. It’s far from the only one.

Stop reading about RAG. Try it. The first 50 exercises on TinkerLLM are free, no card needed.

Open the playground →

What is RAG? Retrieval-Augmented Generation Explained Simply

TL;DR