Language models have a problem. They are trained on enormous corpora of text, acquiring broad knowledge and impressive generative ability. But that knowledge is frozen at the training cutoff. Ask a model about events after that date and it will either say it does not know or, more dangerously, hallucinate plausible-sounding answers. Ask it about your company's internal documents and it has never seen them. Ask it about your product's current pricing and it will either apologize or invent something.

The obvious engineering solution: give the model access to current, specific, or proprietary information at the time it answers. Not by retraining it — that is expensive and slow. By providing relevant documents as input, alongside the query, so the model can ground its response in verified content rather than generating from statistical inference alone.

This is Retrieval Augmented Generation — RAG. It was formalized in a 2020 paper from Facebook AI Research, but the underlying idea is straightforward: combine a retrieval system with a language model, so the model generates answers grounded in retrieved evidence rather than solely in learned parameters.

RAG has become the dominant pattern for deploying language models in production applications that require factual accuracy, current information, or access to specific knowledge bases — enterprise knowledge management, customer support, research tools, and document Q&A. According to a 2023 survey by Gao et al. covering 162 RAG research papers, RAG had grown from a niche retrieval technique into the foundation architecture for knowledge-intensive AI applications across virtually every industry vertical.

"Retrieval augmented generation addresses the fundamental challenge of parametric memory in language models: it's not efficient to encode every fact the world needs into model parameters. Retrieval provides access to a virtually unlimited, updatable knowledge store." — Patrick Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)


Key Definitions

Retrieval Augmented Generation (RAG) — An AI system architecture that combines a language model with a retrieval system. A query triggers retrieval of relevant documents from a knowledge base; those documents are provided to the language model alongside the query; the model generates a response grounded in the retrieved content. RAG reduces hallucinations and enables models to access current or proprietary information without retraining.

Retriever — The component of a RAG system responsible for finding relevant documents in the knowledge base for a given query. Retrievers may use sparse methods (keyword-based search), dense methods (embedding-based semantic search), or hybrid approaches combining both.

Generator — The language model component of a RAG system that produces the final response, conditioned on both the query and the retrieved documents. The generator's quality determines how well it uses the provided context to generate accurate, grounded responses.

Vector embedding — A numerical vector representation of text that captures its semantic meaning. Similar texts have embeddings that are geometrically close (high cosine similarity); dissimilar texts have embeddings that are far apart. Embeddings are produced by embedding models trained specifically for this purpose. They are the foundation of semantic search in modern RAG systems.

Vector database — A database optimized for storing and querying high-dimensional vector embeddings. Enables similarity search: finding documents whose embeddings are closest to a query embedding. Used as the retrieval backend in most production RAG systems. Examples: Pinecone, Weaviate, Chroma, Milvus, pgvector.

Chunking — The process of dividing documents into smaller segments for indexing and retrieval. Because language models have limited context windows, documents too large to fit must be split. Chunking strategy — how large each chunk is, whether chunks overlap, whether chunks respect document structure — significantly affects retrieval quality.

Semantic search — Finding documents based on meaning rather than exact keyword match. A semantic search for "car maintenance" might retrieve documents about "vehicle upkeep" or "automobile service" even if those exact words do not appear in the query. Semantic search in RAG systems is enabled by embedding-based similarity search.

Sparse retrieval — Retrieval based on term frequency or keyword overlap, typically using algorithms like BM25 (Best Match 25). Sparse retrieval is fast and effective for queries where keywords matter — technical terms, proper nouns, specific phrases. It does not capture semantic similarity.

Dense retrieval — Retrieval based on embedding similarity. A query and all documents are converted to embeddings; retrieval finds the most semantically similar documents. Dense retrieval captures meaning better than sparse retrieval for natural language queries, but may miss exact keyword matches.

Hybrid retrieval — Combining sparse and dense retrieval, often by interleaving or reranking results from both methods. Hybrid retrieval generally outperforms either approach alone, capturing both keyword specificity and semantic similarity.

Reranking — A post-retrieval step that uses a separate, typically more powerful model to reorder retrieved documents by relevance to the query. Rerankers are typically cross-encoders (considering query and document jointly) and are more accurate than the initial retriever but too slow to apply to the full knowledge base.

Context window — The maximum amount of text a language model can process in a single forward pass. Context windows limit how many retrieved documents can be provided to the generator. Larger context windows reduce the need for aggressive chunking and allow more documents to be included.

Grounding — The property of an AI-generated response that can be traced back to and verified against specific source documents. Grounded responses reduce hallucination risk and enable citation. RAG is the primary technique for achieving grounding in deployed systems.


The RAG Architecture: How It Works

A standard RAG system has two phases: indexing (offline) and querying (online).

Indexing Phase (Offline)

Before the system can answer queries, the knowledge base must be indexed:

  1. Document ingestion: Source documents (PDFs, web pages, databases, internal documents) are collected and preprocessed — extracting text from various formats, cleaning formatting artifacts.

  2. Chunking: Documents are split into segments sized to fit in the context window alongside the query and other chunks. Typical chunk sizes range from 256 to 1,024 tokens. Strategies include fixed-size chunking, sentence-aware chunking (splitting at sentence boundaries), paragraph-level chunking, and recursive chunking that tries to preserve document structure.

  3. Embedding: Each chunk is converted to a vector embedding using an embedding model. This is the most computationally expensive part of indexing and is done once per document, then reused for all queries.

  4. Storage: Chunk text and embeddings are stored in a vector database (or combined storage system) that supports efficient similarity search.

Query Phase (Online)

When a user submits a query:

  1. Query embedding: The query is converted to an embedding using the same embedding model used during indexing.

  2. Retrieval: The vector database returns the top-K chunks whose embeddings are most similar to the query embedding. K is typically 3-10, depending on context window size and chunk size.

  3. Optional reranking: Retrieved chunks are reranked by a cross-encoder model to improve relevance ordering.

  4. Context construction: Retrieved chunks are assembled into a context string, typically with their source information (document title, page number, URL) included.

  5. Generation: The language model receives a prompt containing the query and the retrieved context, with instructions to answer based on the provided documents. The model generates a response grounded in the retrieved content.

  6. Optional citation: The response may include references to the specific source documents from which information was drawn.


The Original 2020 Paper: What Lewis et al. Proved

The foundational RAG paper — Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuettler, H., Lewis, M., Yih, W., Rocktaeschel, T., Riedel, S., and Kiela, D. — was published at NeurIPS 2020 and demonstrated that retrieval augmentation systematically outperforms pure parametric models on knowledge-intensive tasks.

The team tested two RAG formulations: RAG-Sequence (retrieving one set of documents for the entire answer) and RAG-Token (potentially retrieving different documents for each generated token). They compared against state-of-the-art closed-book models on several benchmarks:

  • On Natural Questions, RAG achieved 44.5% exact match, compared to 29.8% for the best closed-book model at the time — a 49% relative improvement
  • On TriviaQA, RAG matched or exceeded retrieval-augmented specialists despite being a general-purpose architecture
  • On WebQuestions and CuratedTrec, RAG set new state-of-the-art results in the open-domain question answering category

Critically, the study also found that RAG-generated responses were rated as more specific, more diverse, and more factually grounded by human evaluators than pure generative models. This established the empirical baseline that motivated the enormous investment in RAG infrastructure that followed.


Chunking Strategies and Their Tradeoffs

Chunking is one of the most consequential design decisions in a RAG system. Poor chunking degrades retrieval quality even when the underlying documents are comprehensive. A 2023 analysis by Shi et al. found that retrieval failure — not model generation failure — accounted for the majority of incorrect answers in evaluated RAG systems.

Strategy Description Strengths Weaknesses
Fixed-size Chunks of equal token count Simple; predictable size May split mid-sentence or mid-argument
Sentence-aware Split at sentence boundaries Preserves linguistic units Variable chunk size
Paragraph-level One chunk per paragraph Preserves thematic coherence Paragraphs vary widely in length
Hierarchical Multiple granularities (paragraph, sentence) Flexible retrieval at different granularities More complex infrastructure
Semantic Split where topic changes Preserves semantic coherence Requires semantic segmentation model
Sliding window Fixed-size with overlap between chunks Reduces boundary artifacts More chunks to store and retrieve

Overlapping chunks — where adjacent chunks share some tokens — help preserve context that would otherwise be cut at chunk boundaries. A typical overlap is 10-20% of chunk size. Research from LlamaIndex's 2023 performance benchmarks showed that overlapping chunking reduced "split context" failures by approximately 15-20% compared to hard boundaries, though at the cost of increased storage and retrieval latency.

The Chunk Size Dilemma

Chunk size is a fundamental tension in RAG design. Smaller chunks:

  • Are more precisely matched to narrow queries
  • May lack sufficient context to answer multi-part questions
  • Result in more chunks to retrieve and more noise

Larger chunks:

  • Contain more context per retrieval unit
  • May be semantically diluted, reducing retrieval precision
  • Consume more of the limited context window

Practical experience and published benchmarks suggest 512 tokens as a reasonable default starting point, with fine-tuning based on the specific characteristics of the knowledge base and query distribution.


RAG vs. Fine-Tuning vs. Prompt Engineering

These three approaches are often considered as alternatives but are more complementary than competing:

Prompt engineering modifies the model's behavior through the instructions and context provided at inference time. It is the fastest to implement and requires no training, but it is limited to what can fit in the context window and cannot teach the model new information.

Fine-tuning modifies the model's weights through additional training on specific data. It is effective for teaching new reasoning patterns, adjusting response style, or improving performance in specific domains. It is not well-suited for providing access to large, changing knowledge bases: fine-tuning is expensive, cannot easily be updated as the knowledge base changes, and does not support attribution.

RAG provides specific documents at inference time. It is well-suited for large, changing, or proprietary knowledge bases; it supports attribution; it does not require modifying model weights; and it can be updated by adding new documents to the knowledge base without retraining.

A 2023 comparison study by Ovadia et al., "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs," directly tested these approaches on factual knowledge tasks. Their finding: RAG consistently outperformed fine-tuning for knowledge retrieval tasks, while fine-tuning showed advantages primarily in stylistic and behavioral adaptation. Critically, fine-tuning degraded on knowledge that changed after the fine-tuning data cutoff — a problem RAG does not have by design.

"For applications where the core requirement is factual grounding in a specific knowledge base, RAG is usually the right starting point. For applications where the core requirement is behavioral modification — a different tone, domain-specific reasoning patterns, specialized task formats — fine-tuning is the right starting point." — Ovadia et al., Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs (2023)

Many production systems use both: RAG for knowledge grounding, fine-tuning for behavioral alignment.


Advanced RAG Techniques

The basic retrieve-then-generate pipeline has been substantially extended by the research community. These techniques address specific failure modes or improve performance for particular use cases.

Query Transformation

Raw user queries often differ significantly from how relevant information is phrased in source documents. Query transformation addresses this by rewriting or expanding the query before retrieval.

  • Hypothetical Document Embeddings (HyDE): Proposed by Gao et al. (2022), HyDE generates a hypothetical answer to the query and then uses that hypothetical answer's embedding for retrieval, rather than the query's embedding. The rationale: the hypothetical answer is in the same style and vocabulary as the documents being retrieved, making embedding similarity more reliable.
  • Multi-query retrieval: Generate multiple reformulations of the query and retrieve for each, then merge and deduplicate results. Reduces the brittleness of any single query formulation.
  • Step-back prompting: For complex queries, first ask the model to identify a more general principle underlying the query, then retrieve documents matching that principle. Improves retrieval for specialized or technical questions.

Self-RAG: Adaptive Retrieval

Asai et al. (2023) introduced Self-RAG, a framework that trains models to decide when to retrieve (rather than always retrieving) and to critically evaluate the relevance and support quality of retrieved passages. The model generates special "reflection tokens" indicating whether retrieval is needed, whether retrieved passages are relevant, and whether the final response is well-supported by the retrieved evidence.

Self-RAG showed significant improvements over standard RAG on several benchmarks, particularly for questions that do not require retrieval (where standard RAG adds noise by inserting irrelevant context).

GraphRAG

Microsoft Research introduced GraphRAG (Edge et al., 2024), which constructs a knowledge graph from source documents and uses graph traversal alongside vector retrieval. GraphRAG significantly outperforms naive RAG on queries that require synthesis across multiple documents or understanding of relationships between entities — what the paper calls "global queries" about a corpus rather than local fact lookup.

In their evaluation of question answering over a large corpus of news articles and academic papers, GraphRAG achieved substantially better comprehensiveness and diversity scores than vector-only RAG, at the cost of significantly higher indexing time and computation.


Common Failure Modes

Retrieval Failures

The generator can only produce accurate responses if the retriever finds the right documents. Common retrieval failures:

  • False negatives: Relevant documents are in the knowledge base but are not retrieved, either because the embedding model does not capture the semantic relationship or because the chunks are too small or too large to match the query.
  • False positives: Irrelevant documents are retrieved and included in context, potentially misleading the generator or consuming context window capacity.
  • Query-document mismatch: The way users phrase queries often differs substantially from how information is phrased in documents. Embedding models that handle this translation imperfectly produce poor retrieval.

Model Faithfulness Failures

Even with relevant documents retrieved, the generator may not use them faithfully:

  • Hallucinating beyond the retrieved context: The model generates claims not supported by the provided documents, reverting to parametric memory.
  • Ignoring contradictions: The model generates responses that ignore retrieved documents that contradict its prior knowledge.
  • Lost-in-the-middle: Research by Liu et al. (2023) found that language models perform worse on information located in the middle of a long context compared to information at the beginning or end — important for RAG systems providing many retrieved documents. In their experiments with GPT-3.5-Turbo and Claude, performance dropped by as much as 20 percentage points when relevant information was placed in the middle of a 20-document context versus at the beginning or end.

Chunking and Context Problems

  • Context fragmentation: A coherent argument or explanation spanning multiple paragraphs is split across chunks; no single chunk contains enough context for the retriever to match it to queries that require the full argument.
  • Context window overflow: More retrieved documents than fit in the context window, requiring truncation and losing potentially relevant content.

Evaluating RAG Systems: The RAGAS Framework

One of the most significant developments in RAG engineering practice was the introduction of systematic evaluation frameworks. RAGAS (Retrieval Augmented Generation Assessment), introduced by Es et al. (2023), provides automated metrics for assessing RAG pipeline quality across four dimensions:

RAGAS Metric What It Measures How It Works
Faithfulness Does the answer stick to the retrieved context? Checks if all claims in the answer can be attributed to the context
Answer Relevancy Does the answer address the question? Measures semantic similarity between answer and question
Context Precision Are retrieved documents relevant? Checks if retrieved documents actually contain information for the answer
Context Recall Were all relevant documents retrieved? Compares retrieved context against ground truth answers

RAGAS-based evaluation revealed that in typical production RAG systems, context precision and faithfulness are often the binding constraints — the system retrieves too many low-quality documents, and the generator is not sufficiently strict about staying within the retrieved context.


Building Production RAG Systems

The gap between a prototype RAG system (query, retrieve, generate) and a production RAG system is substantial. Production systems require:

Evaluation pipelines: Systematic measurement of retrieval quality (precision, recall, mean reciprocal rank) and response quality (faithfulness to retrieved context, answer relevance, groundedness). Tools like RAGAS provide automated evaluation frameworks.

Monitoring: Tracking retrieval latency, response quality metrics, and failure modes in production traffic.

Document update handling: Strategies for re-embedding and re-indexing documents when they change, without requiring full re-indexing of the knowledge base.

Access control: Ensuring users can only retrieve documents they have permission to see — critical for enterprise deployments with multiple user roles and sensitive document categories.

Real-World Deployment Scale

By 2024, RAG had become a standard enterprise architecture at scale. Analysts at Gartner estimated that over 70% of enterprise AI deployments involving knowledge retrieval used some form of RAG pipeline. Vector database vendors reported substantial growth: Pinecone reported over 100,000 developers using its platform by early 2024, and Weaviate, Chroma, and pgvector (the open-source PostgreSQL extension for vector similarity search) saw similarly rapid adoption curves.

A key measure of RAG's production value: Microsoft's Copilot, Salesforce's Einstein AI, and ServiceNow's Now Assist all use RAG architectures at their core, combining proprietary enterprise data retrieval with large language model generation.

Latency Optimization

Production RAG must balance retrieval quality against latency. The full pipeline — query embedding, vector search, optional reranking, context assembly, generation — adds overhead compared to direct model inference. Optimization strategies include:

  • Approximate nearest neighbor (ANN) search: Vector databases use ANN algorithms (HNSW, IVF, PQ) to find approximate matches much faster than exact search, with negligible quality loss at practical scales
  • Embedding caching: Cache embeddings for frequently queried terms to avoid redundant embedding computation
  • Asynchronous retrieval: Initiate retrieval while beginning to stream the initial response, updating it with retrieved context
  • Tiered retrieval: Fast sparse retrieval for an initial candidate set, followed by dense reranking only on candidates

The Future of RAG

Long Context vs. RAG

One active debate in the research community is whether improvements in language model context windows will eventually make RAG unnecessary. Models like Claude 3 (200K token context window) and Gemini 1.5 Pro (up to 1 million tokens) can, in principle, ingest entire knowledge bases directly in context without retrieval.

Research suggests the relationship is nuanced: RAG and long-context models are complementary, not competing. For very large knowledge bases (millions of documents), RAG remains essential because even 1M token windows cannot hold everything. For medium-sized knowledge bases with precise information needs, long-context models reduce retrieval errors. For knowledge bases that change frequently, RAG's updateability remains an advantage even at smaller scales.

"We find that both long-context models and RAG are imperfect at different failure modes. Long-context models can attend to any content they receive, but retrieving relevantly is still needed for very large corpora. RAG is efficient but relies entirely on retrieval quality. The best systems combine both." — Xu et al., Retrieval Meets Long Context Large Language Models (2023)

Multimodal RAG

Retrieval augmentation is expanding beyond text. Multimodal RAG systems retrieve images, audio, diagrams, and structured data alongside text, providing the generator with richer evidence. Applications include:

  • Medical imaging AI that retrieves similar cases from radiology archives before generating reports
  • Legal research systems that retrieve and reason across case documents, statutes, and precedent together
  • Manufacturing quality control systems that retrieve historical failure images alongside written specifications

For related concepts, see AI hallucinations explained, large language models explained, and AI prompt engineering guide.


References

  • Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuettler, H., Lewis, M., Yih, W., Rocktaeschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33. https://arxiv.org/abs/2005.11401
  • Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., & Wang, H. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv preprint arXiv:2312.10997. https://arxiv.org/abs/2312.10997
  • Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv preprint arXiv:2307.03172. https://arxiv.org/abs/2307.03172
  • Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of EMNLP 2020. https://arxiv.org/abs/2004.04906
  • Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389. https://doi.org/10.1561/1500000019
  • Es, S., James, J., Anke, L. E., & Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217. https://arxiv.org/abs/2309.15217
  • Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv preprint arXiv:2310.11511. https://arxiv.org/abs/2310.11511
  • Gao, L., Ma, X., Lin, J., & Callan, J. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv preprint arXiv:2212.10496. https://arxiv.org/abs/2212.10496
  • Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv preprint arXiv:2404.16130. https://arxiv.org/abs/2404.16130
  • Ovadia, O., Brief, M., Mishaeli, M., & Elisha, O. (2023). Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. arXiv preprint arXiv:2312.05934. https://arxiv.org/abs/2312.05934
  • Xu, P., Ping, W., Wu, X., McAfee, L., Zhu, C., Liu, Z., Subramanian, S., Bakhturina, E., Shoeybi, M., & Catanzaro, B. (2023). Retrieval Meets Long Context Large Language Models. arXiv preprint arXiv:2310.03025. https://arxiv.org/abs/2310.03025

Frequently Asked Questions

What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) is an AI architecture that combines a language model with a retrieval system. When a query is received, a retriever searches a knowledge base for relevant documents. Those documents are injected into the language model's context alongside the query, and the model generates a response grounded in the retrieved content. RAG allows language models to answer questions about specific, up-to-date, or proprietary information without retraining.

Why use RAG instead of just asking the language model directly?

Language models have fixed training data with a knowledge cutoff date, and they hallucinate when generating facts from statistical inference rather than retrieved sources. RAG addresses both problems: it provides the model with specific, current, or proprietary documents at query time, grounding responses in verifiable sources and significantly reducing hallucinations. RAG also allows attribution — responses can be traced back to specific documents.

How does RAG retrieve relevant documents?

The most common retrieval approach in modern RAG systems uses vector embeddings and semantic search. Documents are converted to numerical vector embeddings using an embedding model; the query is converted to an embedding in the same space. Relevant documents are retrieved by finding those whose embeddings are most similar to the query embedding — typically using cosine similarity or approximate nearest neighbor search in a vector database. This captures semantic similarity rather than just keyword matching.

What is a vector database?

A vector database is a database optimized for storing and searching high-dimensional numerical vectors (embeddings). Unlike traditional databases that search by exact or fuzzy string match, vector databases support similarity search: finding the vectors closest to a query vector in high-dimensional space. Popular vector databases include Pinecone, Weaviate, Chroma, Milvus, and pgvector (an extension for PostgreSQL). They are the retrieval backbone of most production RAG systems.

What is the difference between RAG and fine-tuning?

Fine-tuning updates the model's weights by training it on specific data. RAG leaves the model weights unchanged and instead provides relevant documents at inference time. Fine-tuning is better for teaching the model new reasoning patterns, styles, or domain-specific behaviors. RAG is better for providing access to specific factual information, keeping responses current, and enabling attribution. RAG is also more economical than fine-tuning for frequently-changing knowledge bases.

What are the main failure modes of RAG systems?

Common RAG failure modes include: retrieval failures (the retriever fails to find relevant documents, or retrieves irrelevant ones); context window overflow (retrieved documents exceed the model's context limit); model faithfulness failures (the model ignores or contradicts retrieved documents); chunking errors (documents are split in ways that lose important context); and embedding mismatch (the embedding model fails to capture the semantic similarity between query and relevant documents).

When should you use RAG versus other approaches?

Use RAG when you need the model to access specific, up-to-date, or proprietary factual information — company documents, current data, specialized knowledge bases. Use fine-tuning when you need to modify the model's reasoning style, domain-specific behavior, or output format. Use prompt engineering and system prompts when you need to control response tone, format, or general behavior without additional knowledge. Many production systems combine RAG with fine-tuning.