Building a RAG Pipeline: A Practical Tutorial (2026)

A language model will make things up and sound sure about it. It will invent a refund policy that never existed and state it like fact. That is the whole problem.

RAG closes that gap. RAG means retrieval-augmented generation. It makes the model answer from your data (your docs, your contracts, your knowledge base) and cite where each claim came from. The model stops guessing. It starts quoting.

This is a build-it tutorial. I will walk the whole pipeline: ingestion, chunking, embeddings, retrieval, prompt assembly, and the evaluation step almost every tutorial skips. TypeScript throughout, because that is probably your stack.

First, the part nobody tells you.

When RAG Is the Wrong Tool

RAG has become the default answer to "make the LLM know about X." Half the time it is the wrong call.

If your data fits in the context window, just put it there. Models in 2026 handle 200K+ tokens fine. If you answer questions about a single 40-page contract, you do not need a vector store, an embedding model, and a retrieval layer. You paste the contract into the prompt and cache it. RAG adds latency, infrastructure, and a whole class of "the retriever missed the right chunk" bugs. Do not pay that cost to solve a problem you do not have.

If you want to change the model's behavior or style, RAG will not do it. Want it to write in your brand voice, follow a strict output format, or reason in a domain-specific way? That is fine-tuning territory. RAG injects facts. It does not reshape how the model thinks.

RAG is the right tool when your data is too big for context, changes often, and you need answers you can trace back to specific source documents. Support knowledge bases, internal wikis, legal and compliance docs, product docs that update weekly. That is the sweet spot. If that is you, keep reading.

Ingestion and Loading

The pipeline starts by getting your documents into a normal form: plain text plus metadata. PDFs, HTML, Markdown, Notion exports. They all become { text, source, title, url }.

Two things people get wrong here. First, they lose the metadata. The source URL and document title are what make citations possible later. Do not drop them. Second, they let bad extraction through. A PDF parser that mangles tables into word soup poisons everything downstream. Garbage in, confidently-cited garbage out. Spot-check your extracted text before you trust it.

Chunking: Where Most Pipelines Quietly Fail

You cannot embed a whole document. You split it into chunks, embed each one, and retrieve the relevant ones. How you split is the single highest-leverage decision in the whole pipeline, and the lazy way is everywhere.

Naive fixed-size chunking ("every 500 characters, cut") fails because it slices through the middle of sentences, splits a heading from the paragraph it introduces, and cuts a code block in half. You end up embedding fragments that mean nothing on their own. The retriever then misses them, or pulls a chunk that is missing the context that made it useful.

Do two things instead.

Respect structure. Split on semantic boundaries (paragraphs, headings, list items) not arbitrary character counts. Keep a heading attached to the content beneath it.

Add overlap. Let consecutive chunks share a bit of text (10 to 20 percent) so a sentence spanning a boundary survives in at least one chunk.

Here is a structure-aware chunker I reach for as a starting point. It splits on paragraph boundaries, packs them up to a target size, and carries overlap forward:

// chunker.ts
interface Chunk {
  text: string;
  index: number;
}

export function chunkByStructure(
  doc: string,
  { targetSize = 900, overlap = 150 }: { targetSize?: number; overlap?: number } = {}
): Chunk[] {
  // Split on blank lines = paragraph / block boundaries
  const blocks = doc
    .split(/\n\s*\n/)
    .map(b => b.trim())
    .filter(Boolean);

  const chunks: Chunk[] = [];
  let buffer = '';

  const flush = () => {
    if (!buffer) return;
    chunks.push({ text: buffer.trim(), index: chunks.length });
    // Carry the tail of this chunk into the next one for continuity
    buffer = overlap > 0 ? buffer.slice(-overlap) : '';
  };

  for (const block of blocks) {
    // A single block bigger than the target gets its own chunk
    if (block.length > targetSize) {
      flush();
      chunks.push({ text: block, index: chunks.length });
      buffer = '';
      continue;
    }
    if (buffer.length + block.length > targetSize) flush();
    buffer += (buffer ? '\n\n' : '') + block;
  }
  flush();

  return chunks;
}

This is on purpose simple. For Markdown or code, go further: split on headings, never break inside a fenced code block, and store the heading path (Section > Subsection) as metadata so a retrieved chunk knows where it lives. The rule holds. Chunk along the document's own seams, not a ruler.

Embeddings and the Vector Store

Each chunk gets turned into an embedding: a list of numbers that captures its meaning. Similar meaning, similar numbers. That is what lets you find relevant chunks by meaning instead of keyword matching. The place you store these vectors is the vector store.

For the embedding model, OpenAI's text-embedding-3-small is a fine, cheap default. If your data cannot leave your infrastructure, run an open model like bge-large or nomic-embed locally. Pick one and stay consistent. You cannot mix embedding models between indexing and querying. The vectors will not be comparable.

For the store:

pgvector: if you already run Postgres, just add the extension. One database, transactional, no new infrastructure. My default for most projects. Fine into the millions of vectors.
Qdrant: purpose-built vector DB. Reach for it when you need serious scale, fast metadata filtering, or built-in hybrid search.
Pinecone / managed: when you do not want to run anything yourself.

Do not overthink this. Start with pgvector. You can move later if you actually hit a wall, and most projects never do.

Retrieval That Actually Works

This is where a mediocre pipeline and a good one split. Naive retrieval is "embed the query, grab the top-k nearest chunks, done." It works in demos and lets you down in production.

Three upgrades earn their keep.

Hybrid search. Pure semantic (vector) search misses exact terms: product SKUs, error codes, function names, surnames. Pure keyword search misses paraphrases. Run both and merge the results. Semantic catches "how do I reset my password" against a doc titled "Account Recovery." Keyword catches ERR_CONN_REFUSED word for word. You want both.

Re-ranking. Re-ranking means a second, more careful scoring pass. Retrieve wide first (top 20 or 30) then run a cross-encoder re-ranker (Cohere Rerank, or a local bge-reranker) to score each candidate against the query and keep the best 5. The first vector search is fast but rough. The re-ranker is slower but much more precise. Retrieve wide, re-rank narrow. This one change fixes more "it did not find the obvious answer" complaints than anything else.

Then assemble the prompt with citations. Number your chunks, hand them to the model, and tell it to cite by number and to refuse when the context does not contain the answer.

// retrieve-and-answer.ts
async function answerQuestion(query: string) {
  // 1. Hybrid retrieve, then re-rank down to the best few
  const candidates = await hybridSearch(query, { limit: 25 });
  const top = await rerank(query, candidates, { keep: 5 });

  // 2. Build a numbered, source-tagged context block
  const context = top.map((c, i) => `[${i + 1}] (source: ${c.source})\n${c.text}`).join('\n\n---\n\n');

  // 3. Constrain the model to the provided context
  const system = `Answer ONLY from the numbered context below.
Cite the sources you use with their number, like [2].
If the context does not contain the answer, say "I don't have that information."
Never invent facts that aren't in the context.`;

  return llm.chat({
    system,
    messages: [{ role: 'user', content: `${context}\n\nQuestion: ${query}` }],
    temperature: 0,
  });
}

That temperature: 0 and the explicit "say I don't have that information" instruction matter. They are what turn a model that confidently invents refund policies into one that admits the gap. The whole point of RAG is grounded, honest answers. Your prompt has to demand them.

The Step Everyone Skips: Evaluation

Most RAG tutorials end at the snippet above. "It works!" It does not necessarily work. It returns something, which is not the same thing.

You need to measure three things, and you need a small evaluation set to measure against: 30 to 50 real questions with known correct answers and known source documents.

Retrieval quality. For each question, did the right chunk actually make it into the retrieved set? If the right document never gets retrieved, no amount of prompt engineering saves you. Track recall@k. This is the first thing to check when answers are bad. Usually the problem is here, not in the model.

Groundedness. Is every claim in the answer backed by the retrieved context? You can check this with an LLM-as-judge: feed it the answer and the sources and ask whether each sentence is backed by the provided text.

Hallucination rate. The opposite. How often does it state something the context does not support? This is the number you watch over time. When it creeps up, something upstream broke.

Run this eval set every time you change anything: chunk size, embedding model, top-k, the prompt. Tools like Ragas, or a few dozen lines of your own judging code, do the job. Without this loop you are tuning blind, and "it felt better" is not a metric you can ship on. I have watched teams burn weeks tweaking prompts when their real problem was a chunker splitting answers across two chunks, neither of which got retrieved. Measure first.

Production Notes

A few things that bite once this is real and not a demo.

Your documents stay in your storage. RAG does not ship your whole corpus to the model provider. Only the handful of retrieved chunks for a given query ride along in the prompt. Your source of truth stays in your database, your S3 bucket, your control. That is a real selling point for clients nervous about data governance, and worth saying plainly.

Latency and cost. Each query now spans embedding, vector search, re-ranking, and generation. Re-ranking is usually the slowest hop. Budget for it, cache hard, and consider skipping the re-ranker for low-stakes queries. Embeddings are cheap. Re-ranking and generation are where the bill lands.

Freshness and re-indexing. Your data changes. Stale chunks give confidently-cited wrong answers, the worst failure mode, because it looks trustworthy. Re-embed changed documents on a schedule or on write. Track which source version each chunk came from so you can clear it out cleanly when a doc updates.

That last point is exactly why RAG beats fine-tuning for fast-moving data: updating the knowledge is re-indexing a few documents, not retraining a model.

If you combine retrieval with tools and actions, this slots straight into AI agent development. The retriever just becomes one more tool the agent calls. I went deeper on the action side in building AI agents with blockchain integration, where the same isolation rules apply.

So before you reach for a vector store: does your data actually fit in the context window already?

If you are building a RAG pipeline and want a senior pair of eyes before it answers real customer questions, hire me. Or read more about how I approach AI agent development. The first call is free.