Building a RAG Pipeline — The Decisions That Actually Matter
Most RAG pipelines fail before the LLM gets involved. The chunking is wrong, the retrieval returns noise, or the context window is stuffed with irrelevant text. By the time the model generates an answer, the damage is done.
The frustrating part is that none of those failures are LLM problems. They are pipeline problems — upstream decisions made without enough thought. And because the output looks plausible rather than obviously broken, teams spend weeks tuning prompts when the issue is a chunk boundary from the indexing stage.
This post is about those upstream decisions. Not the code — the choices. What chunking strategy to start with, why hybrid retrieval beats pure vector search in most production cases, and how context assembly determines whether the model actually uses what you retrieved.
🔗 Foundation posts
RAG — Retrieval Augmented Generation Explained — what RAG is and why it exists — read this first if you are new to the concept
Vector Databases Explained — embeddings, cosine similarity and the same-model rule — foundational for the indexing sections below
The two phases every RAG pipeline has
Every RAG system operates in two distinct phases. Understanding this separation is the most important structural insight before you touch a single component.
The indexing pipeline runs once — or periodically when your documents change. It takes your source documents, processes them into a form the model can search, and stores everything in a vector database.
The retrieval pipeline runs on every query. It takes the user’s question, finds the relevant chunks from the index, assembles them into a prompt, and sends that to the LLM to generate a grounded answer.
| Phase | When it runs | What it does |
|---|---|---|
| Indexing | Once (then on document updates) | Load → chunk → embed → store in vector DB |
| Retrieval | Every query | Embed query → search → assemble context → generate |
The reason the two-phase model matters: decisions made in the indexing phase lock in constraints for the retrieval phase. A poor chunking decision made during indexing cannot be fixed by a better prompt. You have to reindex.
📌 Key Takeaway
Retrieval quality is determined almost entirely by indexing decisions — chunk size, embedding model, index type. The LLM at the end is only as good as what you feed it.
Document ingestion — what goes in determines what comes out
The first step of the indexing pipeline is loading your documents. This sounds trivial. It is not.
The challenge is that real enterprise documents are messy. PDFs with scanned tables. Word documents with embedded images. HTML pages with navigation menus mixed into the body text. HTML extraction gives you boilerplate. PDF extraction breaks table structure. If you feed the model noise at this stage, it embeds and stores that noise — and your retrieval will surface it faithfully.
Two things matter most at ingestion: extracting clean text and preserving metadata.
| What to preserve | Why it matters for retrieval |
|---|---|
| Document title and source | Allows citation grounding — the model can say where it found the information |
| Section headings | Adds structural context to chunks — a chunk from ‘Returns Policy’ is more retrievable than unattributed text |
| Document date or version | Lets you filter retrieval by recency — important for policy documents and release notes |
| File type / content type | Different parsing strategies for PDFs vs HTML vs markdown — mixing them without flagging degrades quality |
💡 Practical Tip
Run a sample of 20–30 documents through your extraction step and read the raw output before indexing anything at scale. You will almost always find encoding errors, merged table cells, or navigation fragments that need cleaning. Fix them at source, not downstream.
Chunking — the decision most people get wrong
Chunking is the step that breaks your documents into the segments that actually get embedded and stored. It is the highest-leverage decision in the entire indexing pipeline — and the most commonly underestimated one.
The core tension is this: chunks that are too small lose context. Chunks that are too large dilute relevance. A chunk containing three unrelated topics will embed somewhere between all three — and will be retrieved by none of them reliably.
The three main strategies
| Strategy | How it works | When to use it |
|---|---|---|
| Fixed-size | Split at a set token count, with optional overlap between adjacent chunks | Fast to implement, good baseline. Start here to benchmark. |
| Recursive | Split at natural boundaries (paragraphs, then sentences, then words) until chunks fit the target size | Better boundary preservation than fixed-size. Common default in most frameworks. |
| Semantic | Group sentences by embedding similarity — split where topic shifts rather than at token counts | Higher retrieval quality for varied content, but slower at indexing time. Use after benchmarking shows the extra cost is worth it. |
For most production pipelines, recursive chunking at 400–512 tokens with 10–20% overlap is a sound starting point. It is the approach that balances quality with simplicity across the widest range of document types.
That said: there is no universal best chunk size. The right number depends on your corpus, your query patterns and your embedding model. Treat the starting point as a baseline to measure against, not a final answer.
⚠️ Warning
Do not index everything and then tune chunk size. Reindexing is expensive — both computationally and in terms of the embedding API calls it requires. Run small-scale retrieval tests across a representative query set before committing to a chunking strategy at full scale.
Embedding and indexing — the same-model rule
Once your chunks are ready, each one gets converted into an embedding — a numerical vector that represents its meaning — and stored in a vector database. The mechanics of this are covered in depth in the Vector Databases Explained post linked above.
One rule overrides everything else at this stage: use the same embedding model for indexing documents and embedding queries at search time. Vectors from different models are not comparable. Mixing them produces retrieval results that are essentially random — and failures that are very hard to diagnose.
Your choice of embedding model also determines the quality ceiling for your entire retrieval system. A weaker embedding model will produce lower-quality clusters in vector space, regardless of how well you chunk or retrieve.
✅ Best Practice
Pick your embedding model before you index a single document. Changing it later means reindexing everything. For most enterprise use cases, OpenAI text-embedding-3-large or Cohere embed-english-v3.0 are strong defaults. If you are building on SAP BTP, SAP AI Core provides access to embedding models via the Generative AI Hub — no external dependency required.
Retrieval — more than just nearest neighbour
Most introductory RAG tutorials use pure vector search: embed the query, find the N nearest vectors, return the chunks. That works fine in demos. It has real weaknesses in production.
The problem is that vector search finds semantic similarity — but some queries need lexical precision. A user searching for a specific product code, a regulation reference number, or a person’s name is not asking for semantic closeness. They want an exact match. Vector search can miss these entirely while confidently returning something semantically plausible but wrong.
Hybrid retrieval — the production default
Hybrid retrieval combines vector search with BM25 — a keyword-based retrieval algorithm that scores documents by term frequency. Run both in parallel, then merge the ranked results using Reciprocal Rank Fusion (RRF), an algorithm that combines rankings without requiring score normalisation.
Hybrid retrieval consistently outperforms either method alone across document types. For enterprise content — policy documents, technical manuals, financial records — the improvement is significant because these documents contain precise terminology that lexical matching handles better than semantic search.
| Retrieval method | What it does well / where it falls short |
|---|---|
| Pure vector search | Handles paraphrasing and semantic variation well. Misses exact-term queries — product codes, names, specific identifiers. |
| BM25 (keyword) | Precise on exact terms and domain jargon. Misses synonyms and paraphrased queries. |
| Hybrid (BM25 + vector via RRF) | Covers both. Recommended minimum baseline for production pipelines. |
Reranking — precision after recall
After retrieval returns its top results, a reranker re-scores them using a cross-encoder model — one that processes the query and each candidate chunk together, rather than separately. This produces much more accurate relevance scores than the initial retrieval pass.
The pattern is: retrieve broadly (top 50–100 results), rerank precisely (return top 5–10). Retrieval optimises for recall. Reranking optimises for precision. The combination outperforms either stage alone by a wide margin.
Cohere rerank-english-v3.0 and cross-encoder models from Sentence Transformers are the standard tools. Reranking adds some latency, but for top-100 candidate sets the additional time is well within acceptable limits for interactive applications.
📝 Note
Reranking is optional for simple pipelines, but becomes close to essential as document volume grows. The more noise your retrieval step returns, the more a reranker earns its place. If your context window is filling up with marginally relevant chunks, add a reranker before changing anything else.
💡 SAP context
If you are building on SAP BTP, the reference architecture for enterprise RAG uses SAP HANA Cloud as the vector store, SAP AI Core for model access via the Generative AI Hub, and CAP as the application layer. HANA Cloud’s Vector Engine handles both vector similarity search and structured SQL queries in one system — no separate vector database needed. The same-model rule applies: use the same embedding model consistently through AI Core for both indexing and query embedding.
Generation — the prompt that wraps the context
Once retrieval returns its results, the final step is assembling those chunks into the prompt you send to the LLM. This is called context assembly — and it has more impact on answer quality than most teams realise.
The naive approach is to concatenate the top N chunks and append the user question. That works. It also wastes context window space with redundant information, has no clear citation structure, and gives the model no guidance on what to do with the retrieved content.
A more deliberate approach specifies the structure: what the retrieved content is, where it came from, and what you want the model to do with it.
| Context assembly element | Why it matters |
|---|---|
| Source attribution per chunk | Enables citation grounding — the model can reference which document it drew from, reducing hallucination and building trust in the answer |
| Explicit instruction on how to use context | Without it, models sometimes ignore retrieved content in favour of parametric knowledge. Telling the model to answer only from the provided context constrains the output appropriately. |
| Handling ‘not found’ cases | Instruct the model to say it does not know rather than speculate when retrieved context does not contain the answer. This is the single most important guardrail in any RAG prompt. |
| Context ordering | More relevant chunks placed closer to the question tend to produce better answers — the model attends more strongly to nearby context. |
⚠️ Warning
Context window limits are real. A 128K token window sounds vast until you realise that 50 retrieved chunks at 512 tokens each uses 25,600 tokens before you have written a single instruction. Be deliberate about how many chunks you pass. The reranker exists for exactly this reason — cut aggressively before assembly, not after.
The decisions at a glance
| Decision | Recommended starting point | When to revisit |
|---|---|---|
| Chunking strategy | Recursive at 400–512 tokens, 10–20% overlap | When retrieval quality benchmarks show boundary-related misses |
| Chunk size | 400–512 tokens for mixed query types; smaller (128–256) for factoid lookups | After measuring against your actual query distribution |
| Embedding model | text-embedding-3-large or embed-english-v3.0 for production; all-MiniLM-L6-v2 for low-cost local | Only when switching — remember to reindex everything |
| Retrieval method | Hybrid (BM25 + vector via RRF) as minimum baseline | Pure vector only if your corpus has no exact-term queries |
| Reranker | Add when retrieval returns too much noise; rerank-english-v3.0 or cross-encoder | Optional for small corpora; near-essential above 10K chunks |
| Context assembly | Source-attributed chunks + explicit ‘answer from context only’ instruction + ‘say if not found’ guardrail | When the model ignores retrieved content or hallucinates despite correct retrieval |
| Context volume | 5–10 reranked chunks; verify total token count before sending | When answers become vague — often a signal of too many low-relevance chunks, not too few |
What to take away
The RAG pipeline is not one decision — it is six or seven decisions made in sequence, each one constraining the next. A mistake at chunking cannot be fixed by better retrieval. A retrieval strategy that only uses vector search will miss exact-term queries regardless of how sophisticated the prompt is. The LLM is the last piece, not the most important one.
The teams who build reliable RAG systems are not the ones with the most sophisticated models. They are the ones who measured retrieval quality at each stage before moving to the next — who tested chunking on a representative sample, confirmed hybrid retrieval outperformed pure vector on their corpus, and verified that the model was actually using the retrieved context rather than its parametric memory.
Start with the two-phase model as your mental map. Build the indexing pipeline first and validate it independently before connecting retrieval. Add complexity — semantic chunking, reranking, metadata filtering — only where benchmarks show it improves results. Most production failures come from adding sophistication before the basics are solid.
🔗 Related posts on this site
RAG — Retrieval Augmented Generation Explained — the conceptual foundation — what RAG is, why it was introduced and how it fits into the AI landscape
Vector Databases Explained — the indexing layer in depth — embeddings, cosine similarity and choosing the right vector store
Fine-Tuning vs Prompt Engineering vs RAG — when RAG is the right approach versus fine-tuning or prompt engineering alone
AI Hallucinations — Why They Happen — hallucinations are what RAG is designed to reduce — this post explains the root cause
Published on rakeshnarayan.com — Articles
URL: https://rakeshnarayan.com/articles/building-a-rag-pipeline-the-decisions-that-actually-matter/




Did you enjoy this article?
Let me know — it takes one click.
0 Comments
Leave a Comment
Your comment has been submitted and will appear after review.