Artificial Intelligence

Building a RAG Pipeline — The Decisions That Actually Matter

Most RAG pipelines fail before the LLM gets involved. The chunking is wrong, the retrieval returns noise, or the context window is stuffed with irrelevant text. By the time the model generates an answer, the damage is done.

The frustrating part is that none of those failures are LLM problems. They are pipeline problems — upstream decisions made without enough thought. And because the output looks plausible rather than obviously broken, teams spend weeks tuning prompts when the issue is a chunk boundary from the indexing stage.

This post is about those upstream decisions. Not the code — the choices. What chunking strategy to start with, why hybrid retrieval beats pure vector search in most production cases, and how context assembly determines whether the model actually uses what you retrieved.

🔗 Foundation posts

RAG — Retrieval Augmented Generation Explained — what RAG is and why it exists — read this first if you are new to the concept
Vector Databases Explained — embeddings, cosine similarity and the same-model rule — foundational for the indexing sections below

The two phases every RAG pipeline has

Every RAG system operates in two distinct phases. Understanding this separation is the most important structural insight before you touch a single component.

The indexing pipeline runs once — or periodically when your documents change. It takes your source documents, processes them into a form the model can search, and stores everything in a vector database.

The retrieval pipeline runs on every query. It takes the user’s question, finds the relevant chunks from the index, assembles them into a prompt, and sends that to the LLM to generate a grounded answer.

PhaseWhen it runsWhat it does
IndexingOnce (then on document updates)Load → chunk → embed → store in vector DB
RetrievalEvery queryEmbed query → search → assemble context → generate

The reason the two-phase model matters: decisions made in the indexing phase lock in constraints for the retrieval phase. A poor chunking decision made during indexing cannot be fixed by a better prompt. You have to reindex.

📌 Key Takeaway

Retrieval quality is determined almost entirely by indexing decisions — chunk size, embedding model, index type. The LLM at the end is only as good as what you feed it.

Two-phase RAG pipeline diagram on white background showing the indexing pipeline on the left with four steps and the retrieval pipeline on the right with five steps, separated by a bold dividing line

Document ingestion — what goes in determines what comes out

The first step of the indexing pipeline is loading your documents. This sounds trivial. It is not.

The challenge is that real enterprise documents are messy. PDFs with scanned tables. Word documents with embedded images. HTML pages with navigation menus mixed into the body text. HTML extraction gives you boilerplate. PDF extraction breaks table structure. If you feed the model noise at this stage, it embeds and stores that noise — and your retrieval will surface it faithfully.

Two things matter most at ingestion: extracting clean text and preserving metadata.

What to preserveWhy it matters for retrieval
Document title and sourceAllows citation grounding — the model can say where it found the information
Section headingsAdds structural context to chunks — a chunk from ‘Returns Policy’ is more retrievable than unattributed text
Document date or versionLets you filter retrieval by recency — important for policy documents and release notes
File type / content typeDifferent parsing strategies for PDFs vs HTML vs markdown — mixing them without flagging degrades quality

💡 Practical Tip

Run a sample of 20–30 documents through your extraction step and read the raw output before indexing anything at scale. You will almost always find encoding errors, merged table cells, or navigation fragments that need cleaning. Fix them at source, not downstream.

Chunking — the decision most people get wrong

Chunking is the step that breaks your documents into the segments that actually get embedded and stored. It is the highest-leverage decision in the entire indexing pipeline — and the most commonly underestimated one.

The core tension is this: chunks that are too small lose context. Chunks that are too large dilute relevance. A chunk containing three unrelated topics will embed somewhere between all three — and will be retrieved by none of them reliably.

The three main strategies

StrategyHow it worksWhen to use it
Fixed-sizeSplit at a set token count, with optional overlap between adjacent chunksFast to implement, good baseline. Start here to benchmark.
RecursiveSplit at natural boundaries (paragraphs, then sentences, then words) until chunks fit the target sizeBetter boundary preservation than fixed-size. Common default in most frameworks.
SemanticGroup sentences by embedding similarity — split where topic shifts rather than at token countsHigher retrieval quality for varied content, but slower at indexing time. Use after benchmarking shows the extra cost is worth it.

For most production pipelines, recursive chunking at 400–512 tokens with 10–20% overlap is a sound starting point. It is the approach that balances quality with simplicity across the widest range of document types.

That said: there is no universal best chunk size. The right number depends on your corpus, your query patterns and your embedding model. Treat the starting point as a baseline to measure against, not a final answer.

⚠️ Warning

Do not index everything and then tune chunk size. Reindexing is expensive — both computationally and in terms of the embedding API calls it requires. Run small-scale retrieval tests across a representative query set before committing to a chunking strategy at full scale.

Chunking strategy comparison diagram on white background showing three panels — fixed-size chunking with equal blocks and potential mid-sentence splits, recursive chunking preserving paragraph boundaries, and semantic chunking grouping sentences by topic similarity

Embedding and indexing — the same-model rule

Once your chunks are ready, each one gets converted into an embedding — a numerical vector that represents its meaning — and stored in a vector database. The mechanics of this are covered in depth in the Vector Databases Explained post linked above.

One rule overrides everything else at this stage: use the same embedding model for indexing documents and embedding queries at search time. Vectors from different models are not comparable. Mixing them produces retrieval results that are essentially random — and failures that are very hard to diagnose.

Your choice of embedding model also determines the quality ceiling for your entire retrieval system. A weaker embedding model will produce lower-quality clusters in vector space, regardless of how well you chunk or retrieve.

✅ Best Practice

Pick your embedding model before you index a single document. Changing it later means reindexing everything. For most enterprise use cases, OpenAI text-embedding-3-large or Cohere embed-english-v3.0 are strong defaults. If you are building on SAP BTP, SAP AI Core provides access to embedding models via the Generative AI Hub — no external dependency required.

Retrieval — more than just nearest neighbour

Most introductory RAG tutorials use pure vector search: embed the query, find the N nearest vectors, return the chunks. That works fine in demos. It has real weaknesses in production.

The problem is that vector search finds semantic similarity — but some queries need lexical precision. A user searching for a specific product code, a regulation reference number, or a person’s name is not asking for semantic closeness. They want an exact match. Vector search can miss these entirely while confidently returning something semantically plausible but wrong.

Hybrid retrieval — the production default

Hybrid retrieval combines vector search with BM25 — a keyword-based retrieval algorithm that scores documents by term frequency. Run both in parallel, then merge the ranked results using Reciprocal Rank Fusion (RRF), an algorithm that combines rankings without requiring score normalisation.

Hybrid retrieval consistently outperforms either method alone across document types. For enterprise content — policy documents, technical manuals, financial records — the improvement is significant because these documents contain precise terminology that lexical matching handles better than semantic search.

Retrieval methodWhat it does well / where it falls short
Pure vector searchHandles paraphrasing and semantic variation well. Misses exact-term queries — product codes, names, specific identifiers.
BM25 (keyword)Precise on exact terms and domain jargon. Misses synonyms and paraphrased queries.
Hybrid (BM25 + vector via RRF)Covers both. Recommended minimum baseline for production pipelines.

Reranking — precision after recall

After retrieval returns its top results, a reranker re-scores them using a cross-encoder model — one that processes the query and each candidate chunk together, rather than separately. This produces much more accurate relevance scores than the initial retrieval pass.

The pattern is: retrieve broadly (top 50–100 results), rerank precisely (return top 5–10). Retrieval optimises for recall. Reranking optimises for precision. The combination outperforms either stage alone by a wide margin.

Cohere rerank-english-v3.0 and cross-encoder models from Sentence Transformers are the standard tools. Reranking adds some latency, but for top-100 candidate sets the additional time is well within acceptable limits for interactive applications.

📝 Note

Reranking is optional for simple pipelines, but becomes close to essential as document volume grows. The more noise your retrieval step returns, the more a reranker earns its place. If your context window is filling up with marginally relevant chunks, add a reranker before changing anything else.

RAG retrieval pipeline diagram on white background showing three stages — hybrid retrieval combining BM25 and vector search via RRF fusion, reranking with a cross-encoder to produce top 10 results, and context assembly feeding the LLM

💡 SAP context

If you are building on SAP BTP, the reference architecture for enterprise RAG uses SAP HANA Cloud as the vector store, SAP AI Core for model access via the Generative AI Hub, and CAP as the application layer. HANA Cloud’s Vector Engine handles both vector similarity search and structured SQL queries in one system — no separate vector database needed. The same-model rule applies: use the same embedding model consistently through AI Core for both indexing and query embedding.

Generation — the prompt that wraps the context

Once retrieval returns its results, the final step is assembling those chunks into the prompt you send to the LLM. This is called context assembly — and it has more impact on answer quality than most teams realise.

The naive approach is to concatenate the top N chunks and append the user question. That works. It also wastes context window space with redundant information, has no clear citation structure, and gives the model no guidance on what to do with the retrieved content.

A more deliberate approach specifies the structure: what the retrieved content is, where it came from, and what you want the model to do with it.

Context assembly elementWhy it matters
Source attribution per chunkEnables citation grounding — the model can reference which document it drew from, reducing hallucination and building trust in the answer
Explicit instruction on how to use contextWithout it, models sometimes ignore retrieved content in favour of parametric knowledge. Telling the model to answer only from the provided context constrains the output appropriately.
Handling ‘not found’ casesInstruct the model to say it does not know rather than speculate when retrieved context does not contain the answer. This is the single most important guardrail in any RAG prompt.
Context orderingMore relevant chunks placed closer to the question tend to produce better answers — the model attends more strongly to nearby context.

⚠️ Warning

Context window limits are real. A 128K token window sounds vast until you realise that 50 retrieved chunks at 512 tokens each uses 25,600 tokens before you have written a single instruction. Be deliberate about how many chunks you pass. The reranker exists for exactly this reason — cut aggressively before assembly, not after.

RAG prompt anatomy diagram on white background showing four horizontal bands — system instruction in dark navy, retrieved context chunks in teal with source tags, context instruction in amber and user query in green — all within a context window indicator

The decisions at a glance

DecisionRecommended starting pointWhen to revisit
Chunking strategyRecursive at 400–512 tokens, 10–20% overlapWhen retrieval quality benchmarks show boundary-related misses
Chunk size400–512 tokens for mixed query types; smaller (128–256) for factoid lookupsAfter measuring against your actual query distribution
Embedding modeltext-embedding-3-large or embed-english-v3.0 for production; all-MiniLM-L6-v2 for low-cost localOnly when switching — remember to reindex everything
Retrieval methodHybrid (BM25 + vector via RRF) as minimum baselinePure vector only if your corpus has no exact-term queries
RerankerAdd when retrieval returns too much noise; rerank-english-v3.0 or cross-encoderOptional for small corpora; near-essential above 10K chunks
Context assemblySource-attributed chunks + explicit ‘answer from context only’ instruction + ‘say if not found’ guardrailWhen the model ignores retrieved content or hallucinates despite correct retrieval
Context volume5–10 reranked chunks; verify total token count before sendingWhen answers become vague — often a signal of too many low-relevance chunks, not too few

What to take away

The RAG pipeline is not one decision — it is six or seven decisions made in sequence, each one constraining the next. A mistake at chunking cannot be fixed by better retrieval. A retrieval strategy that only uses vector search will miss exact-term queries regardless of how sophisticated the prompt is. The LLM is the last piece, not the most important one.

The teams who build reliable RAG systems are not the ones with the most sophisticated models. They are the ones who measured retrieval quality at each stage before moving to the next — who tested chunking on a representative sample, confirmed hybrid retrieval outperformed pure vector on their corpus, and verified that the model was actually using the retrieved context rather than its parametric memory.

Start with the two-phase model as your mental map. Build the indexing pipeline first and validate it independently before connecting retrieval. Add complexity — semantic chunking, reranking, metadata filtering — only where benchmarks show it improves results. Most production failures come from adding sophistication before the basics are solid.

🔗 Related posts on this site

RAG — Retrieval Augmented Generation Explained — the conceptual foundation — what RAG is, why it was introduced and how it fits into the AI landscape
Vector Databases Explained — the indexing layer in depth — embeddings, cosine similarity and choosing the right vector store
Fine-Tuning vs Prompt Engineering vs RAG — when RAG is the right approach versus fine-tuning or prompt engineering alone
AI Hallucinations — Why They Happen — hallucinations are what RAG is designed to reduce — this post explains the root cause

Published on rakeshnarayan.com — Articles

URL: https://rakeshnarayan.com/articles/building-a-rag-pipeline-the-decisions-that-actually-matter/