Menu
Dev.to #architecture·April 2, 2026

Architecting a Retrieval-Augmented Generation (RAG) Chatbot for Resume Screening

This article details the architecture and development of AskRich, a retrieval-backed chatbot designed to enhance technical screening by providing citation-backed answers from a candidate's portfolio. It explores the system's design, including a Cloudflare Worker at the edge, a LangGraph orchestrator, and a crucial feedback loop for continuous improvement of answer quality and retrieval effectiveness. The discussion also covers the implementation of a resilient rate limiting mechanism.

Read original on Dev.to #architecture

The AskRich chatbot addresses the challenge of insufficient detail in traditional resumes by enabling hiring teams to ask specific technical questions and receive answers grounded in a candidate's actual portfolio and writing, complete with verifiable citations. This core product decision of citation-backed output significantly improves the quality of follow-up questions and overall conversation utility.

Architectural Overview of AskRich

The system features a thin web client interacting with a retrieval-backed chat API. A Cloudflare Worker acts as an edge layer, handling requests with rate limiting and cache checks before forwarding to a LangGraph orchestrator. This orchestrator integrates a Content Index for retrieval and an LLM Response Layer for grounded generation, producing answers with citations for the UI.

plaintext
Browser → POST /api/chat → Cloudflare Worker (rate limit + cache check)
↓ LangGraph Orchestrator
↙ ↘
Content Index   LLM Response Layer
(retrieval)     (grounded generation)
↘ ↙
Answer + Citations → UI Renderer
ℹ️

The Cloudflare Worker supports multiple runtime modes (upstream, local, openai) for flexible testing and routing without client redeployment, showcasing a thoughtful approach to operational agility in AI system development.

Feedback Loop for Continuous Improvement

A critical component of AskRich is its structured feedback loop. It records events for every question, answer, and user interaction (thumbs-up/down), linked by stable event IDs. This allows for precise triage of low-rated answers, classifying failures into categories like Corpus gap, Retrieval/ranking issue, Prompt/format issue, or Out-of-scope. Changes are tested against baselines to ensure improvements without regressing citation clarity.

Edge Rate Limiting Implementation

Rate limiting is enforced at the Cloudflare Worker using a one-way hash of request context (IP + origin + user-agent) for client identification, avoiding persistent storage of raw IPs. A sequential guard system with an hourly quota and burst interval is implemented. A key design decision is graceful degradation (fail-open) if KV storage for the limiter is unavailable, prioritizing availability over strict rate limiting in adverse conditions.

chatbotRAGCloudflare WorkersLangGraphrate limitingfeedback loopssystem architectureLLM

Comments

Loading comments...