Dev.to #architecture·April 4, 2026

Designing RAG Pipelines: Ingestion and Query Shifts

This article provides a detailed breakdown of the two distinct operational shifts in a Retrieval Augmented Generation (RAG) pipeline: ingestion (offline) and query time (live). It emphasizes the architectural decisions and potential failure points within each shift, focusing on critical steps like document parsing, chunking, embedding, and retrieval to ensure accurate and contextually relevant AI responses. Understanding these shifts is crucial for building robust and debuggable RAG systems.

AI & ML Infrastructure Distributed Systems

Read original on Dev.to #architecture

The Two Shifts of a RAG Pipeline

A RAG (Retrieval Augmented Generation) pipeline is fundamentally composed of two distinct operational shifts, each with unique characteristics, requirements, and failure modes. Recognizing this separation is key to designing, debugging, and scaling RAG systems effectively. These shifts are the Ingestion Shift (offline processing) and the Query-Time Shift (live request handling).

Shift 1: Ingestion Pipeline (Offline)

The ingestion shift focuses on preparing raw documents for efficient retrieval. It runs offline, typically whenever documents are updated. Key architectural considerations here revolve around data quality, processing efficiency, and the creation of an effective vector index.

Document Parsing: This is often underestimated. Raw documents (Markdown, HTML, PDFs) need structure-aware parsing to preserve context (e.g., table rows, headings, numbered lists). Losing structure here irrevocably damages downstream retrieval quality.
Chunking: Documents are split into smaller, semantically meaningful chunks to fit within LLM context windows. The challenge is balancing chunk size (enough context vs. too large) and boundary placement (avoid splitting coherent thoughts). Overlapping chunks can help preserve context.
Embedding: Each chunk is converted into a vector representation using an embedding model. The choice of embedding model is critical; a domain-specific model will yield far better results than a general-purpose one, as it understands specialized vocabulary and relationships.
Contextual Enrichment: Optionally, chunks can be enriched with metadata or LLM-generated summaries *before* embedding. This adds disambiguating context (e.g., product name, policy version) which improves retrieval accuracy, especially for similar-sounding chunks.

⚠️

Ingestion Challenges

Failure Point: Poor parsing destroys document structure, leading to irrelevant or incomplete retrieval. Incorrect chunking breaks logical continuity. Suboptimal embedding models fail to capture domain-specific meaning, resulting in 'sounds right but is wrong' results.

Shift 2: Query-Time Pipeline (Live)

The query-time shift executes live for every user question, emphasizing speed and accuracy. It leverages the index built by the ingestion pipeline to find relevant information and generate an answer.

Query Embedding: The user's query is embedded using the *same* model as the ingestion pipeline to ensure consistent vector space comparison.
Vector Search: The query's vector is used to find the most semantically similar chunks in the vector database.
Prompt Assembly & Generation: The retrieved chunks are assembled into a prompt, often alongside the original query, and sent to a Large Language Model (LLM) for generating the final answer. The LLM acts as a summarizer and synthesiser of the provided context.

💡

Hybrid Retrieval

For questions requiring structured data (e.g., 'How many WH-1000 units were returned?'), RAG pipelines can integrate a 'structured data path.' This involves query routing or parallel retrieval from traditional databases to ensure comprehensive answering capabilities beyond just unstructured documents.

RAGLLMVector DatabaseSemantic SearchData IngestionInformation RetrievalPipeline Architecture

Comments

Loading comments...

Architecture Design

Design this yourself

Design a scalable and reliable Retrieval Augmented Generation (RAG) system for an enterprise knowledge base, capable of answering user queries from hundreds of frequently updated documents. Include the ingestion pipeline for parsing, chunking, enriching, and embedding documents into a vector store, as well as the live query-time pipeline for embedding user queries, retrieving relevant chunks, and generating LLM responses. Address challenges like handling diverse document formats, preserving structural context, selecting appropriate embedding models, and integrating structured data sources.

Practice Interview

Focus: Retrieval Augmented Generation (RAG) pipeline for knowledge bases

Other design angles

· Design a RAG system specifically for internal legal documents, focusing on high accuracy and auditability of retrieved sources.· Architect a multi-tenant RAG SaaS platform, ensuring data isolation and customized knowledge bases for each tenant, with cost-effective ingestion and query processing.· Design an embedded RAG component for a mobile application, considering offline capabilities, on-device embeddings, and latency constraints.