Menu
Dev.to #architecture·April 4, 2026

Designing RAG Pipelines: Ingestion and Query Shifts

This article provides a detailed breakdown of the two distinct operational shifts in a Retrieval Augmented Generation (RAG) pipeline: ingestion (offline) and query time (live). It emphasizes the architectural decisions and potential failure points within each shift, focusing on critical steps like document parsing, chunking, embedding, and retrieval to ensure accurate and contextually relevant AI responses. Understanding these shifts is crucial for building robust and debuggable RAG systems.

Read original on Dev.to #architecture

The Two Shifts of a RAG Pipeline

A RAG (Retrieval Augmented Generation) pipeline is fundamentally composed of two distinct operational shifts, each with unique characteristics, requirements, and failure modes. Recognizing this separation is key to designing, debugging, and scaling RAG systems effectively. These shifts are the Ingestion Shift (offline processing) and the Query-Time Shift (live request handling).

Shift 1: Ingestion Pipeline (Offline)

The ingestion shift focuses on preparing raw documents for efficient retrieval. It runs offline, typically whenever documents are updated. Key architectural considerations here revolve around data quality, processing efficiency, and the creation of an effective vector index.

  1. Document Parsing: This is often underestimated. Raw documents (Markdown, HTML, PDFs) need structure-aware parsing to preserve context (e.g., table rows, headings, numbered lists). Losing structure here irrevocably damages downstream retrieval quality.
  2. Chunking: Documents are split into smaller, semantically meaningful chunks to fit within LLM context windows. The challenge is balancing chunk size (enough context vs. too large) and boundary placement (avoid splitting coherent thoughts). Overlapping chunks can help preserve context.
  3. Embedding: Each chunk is converted into a vector representation using an embedding model. The choice of embedding model is critical; a domain-specific model will yield far better results than a general-purpose one, as it understands specialized vocabulary and relationships.
  4. Contextual Enrichment: Optionally, chunks can be enriched with metadata or LLM-generated summaries *before* embedding. This adds disambiguating context (e.g., product name, policy version) which improves retrieval accuracy, especially for similar-sounding chunks.
⚠️

Ingestion Challenges

Failure Point: Poor parsing destroys document structure, leading to irrelevant or incomplete retrieval. Incorrect chunking breaks logical continuity. Suboptimal embedding models fail to capture domain-specific meaning, resulting in 'sounds right but is wrong' results.

Shift 2: Query-Time Pipeline (Live)

The query-time shift executes live for every user question, emphasizing speed and accuracy. It leverages the index built by the ingestion pipeline to find relevant information and generate an answer.

  1. Query Embedding: The user's query is embedded using the *same* model as the ingestion pipeline to ensure consistent vector space comparison.
  2. Vector Search: The query's vector is used to find the most semantically similar chunks in the vector database.
  3. Prompt Assembly & Generation: The retrieved chunks are assembled into a prompt, often alongside the original query, and sent to a Large Language Model (LLM) for generating the final answer. The LLM acts as a summarizer and synthesiser of the provided context.
💡

Hybrid Retrieval

For questions requiring structured data (e.g., 'How many WH-1000 units were returned?'), RAG pipelines can integrate a 'structured data path.' This involves query routing or parallel retrieval from traditional databases to ensure comprehensive answering capabilities beyond just unstructured documents.

RAGLLMVector DatabaseSemantic SearchData IngestionInformation RetrievalPipeline Architecture

Comments

Loading comments...