Building a RAG Pipeline — The Decisions That Actually Matter

February 3, 2026 · Updated March 25, 2026 · 10 min read

Most RAG pipelines fail before the LLM gets involved. The chunking is wrong, the retrieval returns noise, or the context window is stuffed with irrelevant text. By the time the model generates an answer, the damage is done.

The frustrating part is that none of those failures are LLM problems. They are pipeline problems — upstream decisions made without enough thought. And because the output looks plausible rather than obviously broken, teams spend weeks tuning prompts when the issue is a chunk boundary from the indexing stage.

This post is about those upstream decisions. Not the code — the choices. What chunking strategy to start with, why hybrid retrieval beats pure vector search in most production cases, and how context assembly determines whether the model actually uses what you retrieved.

🔗 Foundation posts

RAG — Retrieval Augmented Generation Explained — what RAG is and why it exists — read this first if you are new to the concept
Vector Databases Explained — embeddings, cosine similarity and the same-model rule — foundational for the indexing sections below

The two phases every RAG pipeline has

Every RAG system operates in two distinct phases. Understanding this separation is the most important structural insight before you touch a single component.

The indexing pipeline runs once — or periodically when your documents change. It takes your source documents, processes them into a form the model can search, and stores everything in a vector database.

The retrieval pipeline runs on every query. It takes the user’s question, finds the relevant chunks from the index, assembles them into a prompt, and sends that to the LLM to generate a grounded answer.

Phase	When it runs	What it does
Indexing	Once (then on document updates)	Load → chunk → embed → store in vector DB
Retrieval	Every query	Embed query → search → assemble context → generate

The reason the two-phase model matters: decisions made in the indexing phase lock in constraints for the retrieval phase. A poor chunking decision made during indexing cannot be fixed by a better prompt. You have to reindex.

📌 Key Takeaway

Retrieval quality is determined almost entirely by indexing decisions — chunk size, embedding model, index type. The LLM at the end is only as good as what you feed it.

Document ingestion — what goes in determines what comes out

The first step of the indexing pipeline is loading your documents. This sounds trivial. It is not.

The challenge is that real enterprise documents are messy. PDFs with scanned tables. Word documents with embedded images. HTML pages with navigation menus mixed into the body text. HTML extraction gives you boilerplate. PDF extraction breaks table structure. If you feed the model noise at this stage, it embeds and stores that noise — and your retrieval will surface it faithfully.

Two things matter most at ingestion: extracting clean text and preserving metadata.

What to preserve	Why it matters for retrieval
Document title and source	Allows citation grounding — the model can say where it found the information
Section headings	Adds structural context to chunks — a chunk from ‘Returns Policy’ is more retrievable than unattributed text
Document date or version	Lets you filter retrieval by recency — important for policy documents and release notes
File type / content type	Different parsing strategies for PDFs vs HTML vs markdown — mixing them without flagging degrades quality

💡 Practical Tip

Run a sample of 20–30 documents through your extraction step and read the raw output before indexing anything at scale. You will almost always find encoding errors, merged table cells, or navigation fragments that need cleaning. Fix them at source, not downstream.

Chunking — the decision most people get wrong

Chunking is the step that breaks your documents into the segments that actually get embedded and stored. It is the highest-leverage decision in the entire indexing pipeline — and the most commonly underestimated one.

The core tension is this: chunks that are too small lose context. Chunks that are too large dilute relevance. A chunk containing three unrelated topics will embed somewhere between all three — and will be retrieved by none of them reliably.

The three main strategies

Strategy	How it works	When to use it
Fixed-size	Split at a set token count, with optional overlap between adjacent chunks	Fast to implement, good baseline. Start here to benchmark.
Recursive	Split at natural boundaries (paragraphs, then sentences, then words) until chunks fit the target size	Better boundary preservation than fixed-size. Common default in most frameworks.
Semantic	Group sentences by embedding similarity — split where topic shifts rather than at token counts	Higher retrieval quality for varied content, but slower at indexing time. Use after benchmarking shows the extra cost is worth it.

For most production pipelines, recursive chunking at 400–512 tokens with 10–20% overlap is a sound starting point. It is the approach that balances quality with simplicity across the widest range of document types.

That said: there is no universal best chunk size. The right number depends on your corpus, your query patterns and your embedding model. Treat the starting point as a baseline to measure against, not a final answer.

⚠️ Warning

Do not index everything and then tune chunk size. Reindexing is expensive — both computationally and in terms of the embedding API calls it requires. Run small-scale retrieval tests across a representative query set before committing to a chunking strategy at full scale.

Embedding and indexing — the same-model rule

Once your chunks are ready, each one gets converted into an embedding — a numerical vector that represents its meaning — and stored in a vector database. The mechanics of this are covered in depth in the Vector Databases Explained post linked above.

One rule overrides everything else at this stage: use the same embedding model for indexing documents and embedding queries at search time. Vectors from different models are not comparable. Mixing them produces retrieval results that are essentially random — and failures that are very hard to diagnose.

Your choice of embedding model also determines the quality ceiling for your entire retrieval system. A weaker embedding model will produce lower-quality clusters in vector space, regardless of how well you chunk or retrieve.

✅ Best Practice

Pick your embedding model before you index a single document. Changing it later means reindexing everything. For most enterprise use cases, OpenAI text-embedding-3-large or Cohere embed-english-v3.0 are strong defaults. If you are building on SAP BTP, SAP AI Core provides access to embedding models via the Generative AI Hub — no external dependency required.

Retrieval — more than just nearest neighbour

Most introductory RAG tutorials use pure vector search: embed the query, find the N nearest vectors, return the chunks. That works fine in demos. It has real weaknesses in production.

The problem is that vector search finds semantic similarity — but some queries need lexical precision. A user searching for a specific product code, a regulation reference number, or a person’s name is not asking for semantic closeness. They want an exact match. Vector search can miss these entirely while confidently returning something semantically plausible but wrong.

Hybrid retrieval — the production default

Hybrid retrieval combines vector search with BM25 — a keyword-based retrieval algorithm that scores documents by term frequency. Run both in parallel, then merge the ranked results using Reciprocal Rank Fusion (RRF), an algorithm that combines rankings without requiring score normalisation.

Hybrid retrieval consistently outperforms either method alone across document types. For enterprise content — policy documents, technical manuals, financial records — the improvement is significant because these documents contain precise terminology that lexical matching handles better than semantic search.

Retrieval method	What it does well / where it falls short
Pure vector search	Handles paraphrasing and semantic variation well. Misses exact-term queries — product codes, names, specific identifiers.
BM25 (keyword)	Precise on exact terms and domain jargon. Misses synonyms and paraphrased queries.
Hybrid (BM25 + vector via RRF)	Covers both. Recommended minimum baseline for production pipelines.

Reranking — precision after recall

After retrieval returns its top results, a reranker re-scores them using a cross-encoder model — one that processes the query and each candidate chunk together, rather than separately. This produces much more accurate relevance scores than the initial retrieval pass.

The pattern is: retrieve broadly (top 50–100 results), rerank precisely (return top 5–10). Retrieval optimises for recall. Reranking optimises for precision. The combination outperforms either stage alone by a wide margin.

Cohere rerank-english-v3.0 and cross-encoder models from Sentence Transformers are the standard tools. Reranking adds some latency, but for top-100 candidate sets the additional time is well within acceptable limits for interactive applications.

📝 Note

Reranking is optional for simple pipelines, but becomes close to essential as document volume grows. The more noise your retrieval step returns, the more a reranker earns its place. If your context window is filling up with marginally relevant chunks, add a reranker before changing anything else.

💡 SAP context

If you are building on SAP BTP, the reference architecture for enterprise RAG uses SAP HANA Cloud as the vector store, SAP AI Core for model access via the Generative AI Hub, and CAP as the application layer. HANA Cloud’s Vector Engine handles both vector similarity search and structured SQL queries in one system — no separate vector database needed. The same-model rule applies: use the same embedding model consistently through AI Core for both indexing and query embedding.

Generation — the prompt that wraps the context

Once retrieval returns its results, the final step is assembling those chunks into the prompt you send to the LLM. This is called context assembly — and it has more impact on answer quality than most teams realise.

The naive approach is to concatenate the top N chunks and append the user question. That works. It also wastes context window space with redundant information, has no clear citation structure, and gives the model no guidance on what to do with the retrieved content.

A more deliberate approach specifies the structure: what the retrieved content is, where it came from, and what you want the model to do with it.

Context assembly element	Why it matters
Source attribution per chunk	Enables citation grounding — the model can reference which document it drew from, reducing hallucination and building trust in the answer
Explicit instruction on how to use context	Without it, models sometimes ignore retrieved content in favour of parametric knowledge. Telling the model to answer only from the provided context constrains the output appropriately.
Handling ‘not found’ cases	Instruct the model to say it does not know rather than speculate when retrieved context does not contain the answer. This is the single most important guardrail in any RAG prompt.
Context ordering	More relevant chunks placed closer to the question tend to produce better answers — the model attends more strongly to nearby context.

⚠️ Warning

Context window limits are real. A 128K token window sounds vast until you realise that 50 retrieved chunks at 512 tokens each uses 25,600 tokens before you have written a single instruction. Be deliberate about how many chunks you pass. The reranker exists for exactly this reason — cut aggressively before assembly, not after.

The decisions at a glance

Decision	Recommended starting point	When to revisit
Chunking strategy	Recursive at 400–512 tokens, 10–20% overlap	When retrieval quality benchmarks show boundary-related misses
Chunk size	400–512 tokens for mixed query types; smaller (128–256) for factoid lookups	After measuring against your actual query distribution
Embedding model	text-embedding-3-large or embed-english-v3.0 for production; all-MiniLM-L6-v2 for low-cost local	Only when switching — remember to reindex everything
Retrieval method	Hybrid (BM25 + vector via RRF) as minimum baseline	Pure vector only if your corpus has no exact-term queries
Reranker	Add when retrieval returns too much noise; rerank-english-v3.0 or cross-encoder	Optional for small corpora; near-essential above 10K chunks
Context assembly	Source-attributed chunks + explicit ‘answer from context only’ instruction + ‘say if not found’ guardrail	When the model ignores retrieved content or hallucinates despite correct retrieval
Context volume	5–10 reranked chunks; verify total token count before sending	When answers become vague — often a signal of too many low-relevance chunks, not too few

What to take away

The RAG pipeline is not one decision — it is six or seven decisions made in sequence, each one constraining the next. A mistake at chunking cannot be fixed by better retrieval. A retrieval strategy that only uses vector search will miss exact-term queries regardless of how sophisticated the prompt is. The LLM is the last piece, not the most important one.

The teams who build reliable RAG systems are not the ones with the most sophisticated models. They are the ones who measured retrieval quality at each stage before moving to the next — who tested chunking on a representative sample, confirmed hybrid retrieval outperformed pure vector on their corpus, and verified that the model was actually using the retrieved context rather than its parametric memory.

Start with the two-phase model as your mental map. Build the indexing pipeline first and validate it independently before connecting retrieval. Add complexity — semantic chunking, reranking, metadata filtering — only where benchmarks show it improves results. Most production failures come from adding sophistication before the basics are solid.

🔗 Related posts on this site

RAG — Retrieval Augmented Generation Explained — the conceptual foundation — what RAG is, why it was introduced and how it fits into the AI landscape
Vector Databases Explained — the indexing layer in depth — embeddings, cosine similarity and choosing the right vector store
Fine-Tuning vs Prompt Engineering vs RAG — when RAG is the right approach versus fine-tuning or prompt engineering alone
AI Hallucinations — Why They Happen — hallucinations are what RAG is designed to reduce — this post explains the root cause

Published on rakeshnarayan.com — Articles

URL: https://rakeshnarayan.com/articles/building-a-rag-pipeline-the-decisions-that-actually-matter/

RAG pipelineBuilding a RAG pipelineRetrieval Augmented GenerationRAG architectureChunking strategyHybrid retrievalBM25 vector searchReranking RAGRAG decisionsSemantic searchRAG enterpriseSAP AI CoreHANA Cloud vectorAI 2026

Building a RAG Pipeline — The Decisions That Actually Matter

The two phases every RAG pipeline has

Document ingestion — what goes in determines what comes out

Chunking — the decision most people get wrong

The three main strategies

Embedding and indexing — the same-model rule

Retrieval — more than just nearest neighbour

Hybrid retrieval — the production default

Reranking — precision after recall

Generation — the prompt that wraps the context

The decisions at a glance

What to take away

0 Comments

Leave a Comment

The two phases every RAG pipeline has

Document ingestion — what goes in determines what comes out

Chunking — the decision most people get wrong

The three main strategies

Embedding and indexing — the same-model rule

Retrieval — more than just nearest neighbour

Hybrid retrieval — the production default

Reranking — precision after recall

Generation — the prompt that wraps the context

The decisions at a glance

What to take away

0 Comments

Leave a Comment

Related Articles

Open Source vs Closed Source AI Models — The Real Trade-offs

How LLMs Are Trained — Pretraining, Fine-Tuning and RLHF