Dev.to #architecture·April 2, 2026

Architecting a Retrieval-Augmented Generation (RAG) Chatbot for Resume Screening

This article details the architecture and development of AskRich, a retrieval-backed chatbot designed to enhance technical screening by providing citation-backed answers from a candidate's portfolio. It explores the system's design, including a Cloudflare Worker at the edge, a LangGraph orchestrator, and a crucial feedback loop for continuous improvement of answer quality and retrieval effectiveness. The discussion also covers the implementation of a resilient rate limiting mechanism.

AI & ML Infrastructure Distributed Systems API Design

Read original on Dev.to #architecture

The AskRich chatbot addresses the challenge of insufficient detail in traditional resumes by enabling hiring teams to ask specific technical questions and receive answers grounded in a candidate's actual portfolio and writing, complete with verifiable citations. This core product decision of citation-backed output significantly improves the quality of follow-up questions and overall conversation utility.

Architectural Overview of AskRich

The system features a thin web client interacting with a retrieval-backed chat API. A Cloudflare Worker acts as an edge layer, handling requests with rate limiting and cache checks before forwarding to a LangGraph orchestrator. This orchestrator integrates a Content Index for retrieval and an LLM Response Layer for grounded generation, producing answers with citations for the UI.

plaintext

Browser → POST /api/chat → Cloudflare Worker (rate limit + cache check)
↓ LangGraph Orchestrator
↙ ↘
Content Index   LLM Response Layer
(retrieval)     (grounded generation)
↘ ↙
Answer + Citations → UI Renderer

ℹ️

The Cloudflare Worker supports multiple runtime modes (upstream, local, openai) for flexible testing and routing without client redeployment, showcasing a thoughtful approach to operational agility in AI system development.

Feedback Loop for Continuous Improvement

A critical component of AskRich is its structured feedback loop. It records events for every question, answer, and user interaction (thumbs-up/down), linked by stable event IDs. This allows for precise triage of low-rated answers, classifying failures into categories like Corpus gap, Retrieval/ranking issue, Prompt/format issue, or Out-of-scope. Changes are tested against baselines to ensure improvements without regressing citation clarity.

Edge Rate Limiting Implementation

Rate limiting is enforced at the Cloudflare Worker using a one-way hash of request context (IP + origin + user-agent) for client identification, avoiding persistent storage of raw IPs. A sequential guard system with an hourly quota and burst interval is implemented. A key design decision is graceful degradation (fail-open) if KV storage for the limiter is unavailable, prioritizing availability over strict rate limiting in adverse conditions.

chatbotRAGCloudflare WorkersLangGraphrate limitingfeedback loopssystem architectureLLM

Comments

Loading comments...

Architecture Design

View Architecture

Design a scalable and reliable Retrieval-Augmented Generation (RAG) chatbot platform, similar to AskRich, that can serve multiple users or organizations. Focus on the core architectural components, including the edge layer (e.g., Cloudflare Worker), the RAG pipeline with content indexing and LLM integration, robust rate limiting, and a continuous feedback loop mechanism for improving answer quality and retrieval effectiveness.

Practice Interview

Focus: Retrieval-Augmented Generation (RAG) chatbot with edge rate limiting and a feedback loop

Other design angles

· Design only the distributed rate limiting component for an API gateway, considering different strategies like fixed window, sliding window, and token bucket, and how to handle graceful degradation.· Design a data pipeline and feedback system for an AI application to continuously monitor model performance, identify failure modes (e.g., corpus gaps, retrieval issues), and automate improvements.· Design a multi-tenant RAG platform where each tenant has its own isolated knowledge base and can configure specific LLM interactions, while ensuring data privacy and efficient resource utilization.